# When data begins to speak

In this tutorial you will see when data begins to speak -- this is an example of web scrapping and turning data into something useful. You will learn how to grab data, transform them and analyse.

Data source characteristic

This year, again, I had a chance to participate in Poland Business Run charity run. Formula of this run is quite simple:

• Companies submit four-person teams to take part in the competition paying for each the starting fee - this is how funds are collected.
• Because of the pandemia, this year, similarly to 2020, you could cover the distance (4km) anywhere in the world and you could do this on your own, with your teammates or any other group, according to your preferences.
• The route must be completed between 00:00 and 18:00 Central European Summer Time (CEST).
• After finishing the race, participants individually submit their results via the mobile application or a website.
• Finally, when run ends, you can check how fast you were visiting results web page.

Visiting results web page is how you can check your personal or your team results. This is also what you can try to use to make simple date to speak to you and to turn simple data into something valuable.

Examine data source

This year 26799 runners has finished Poland Business Run. A lot of people, a lot of data. Let's see how you can get them.

Visiting results web page you should see something like this:

After scroling the table down, you will notice that data is paginated, divided into 50-records chunks:

At the bottom you can find two buttons to move over results with 50-records steps. If you hover over one of them, you will notice unchanged hyper reference:

Now there are two options:

Let's verify which option was used in our case. Opening developer tools in my browser (in Safari select: Programowanie/Pokaż kod źródłowy strony or press Option-Command-U; alternatively you can select Programowanie/Pokaż inspektora www, or press Option-Command-I; then select Sources) for results web page I get the following result:

If you prefer you can use another view showing webpage elements:

Whatever you select, you can notice that record numer 50 is the last record. There is no more records at this moment and there are no hidden elements which are displayed by JavaScript. Base on this, you can infer that second option, with asynchronous load is used in this case.

Now you have to investigate what happens when you press Next button.

I selected Network tab in my browser and then pressed Next button. The result is given below:

As you can see there was a call to

file. If you hover over file name, you will get a full request:

which in this case takes the following form:

A full request information are available if you click on file name:

If you click again the Next button, request will take the form:

Open a new tab in your browser and use the following URL:

This should load results from 151 to 200 and this is what you really get (pbr_2021_00150.json).

Perfect! Now you know general request pattern, so everything you need is a piece of code to grab all data.

Getting raw data

To grab all data you can use short Python script given below. Before you run this code be sure to create raw_data directory in the same location where you keep script file.

Note:
If your directory has longer path, you can build it as given below:

The * operator unpacks a container by producing individual elements of the container.
End note

Note:
If you work in Googl Colaboratory remember to mount your drive:

and next you can use your folder (in the following example my 'base folder' is bd):

End note

It will take a few seconds to get all data. After that you can check contents of raw_data directory:

As you can see, pbr_2021_00000.json file has 34050 bytes (approximately 33.25)KiB, next file, pbr_2021_00050.json file has 33798 bytes (approximately 33.0KiB), and so on, till pbr_2021_04650.json file which has much less than 30kB -- to be exact, it has 8875 bytes (approximately 8.66 KiB). All the other files, starting from pbr_2021_04700.json to pbr_2021_05400.json have exactly 23 bytes which suggest that all of them are identical and no usable contents. To be sure, you can check contents of few files:

• pbr_2021_04600.json file, has the following contents:

Data starts from 4601 to 4650, so there are 50 records inside.
• pbr_2021_04650.json file, which looks like below:

Data starts from 4651 to 4663, so the difference is less than expected 50.
• pbr_2021_04700.json file, which is really simple:

Yes, there is nothing interesting in these files.

When you have all raw data, it's time to extract html code. You can complete this with simple code given below (extract_html.py):

Unfortunately at some point you will get result similar to the following:

This means that there is something wrong with pbr_2021_02250.json file. If you take a closer look into its contents, you will notice unexpected message inside:

Remove non-json part:

from pbr_2021_02250.json file and again run extract_html.py script. This time your run shouldn't be interrupted by unexpected data.

Finally at this step you should get html code (html.zip) extracted from raw data you grabbed at the beginning (raw_data_original.zip).

Save data in handy data structure

HTML file is not bad data format, but a little bit difficult to further processing, so it would be nice to have all of them in more handy form, like for example csv.

To make this transformation, you have to examine HTML content first. You can select any HTML file, for example pbr_2021_00000.htm:

Unfortunatley it is a one very long line. Rearanging HTML for better visualisation you will see:

Now you can see, that this HTML is not perfectly correct, because the closing tag for <td class="text-left"> is missing. To parse this HTML you can use ready to use libraries like BeautifulSoup. For production purposes this is a good approach, but now I want to show you how you can complete this with powerful regular expressions.

First, you will find all rows -- group of characters between <tr> and <tr/> tag. You can use the following regular expression:

Why .*? instead of simply .* where . matches any character (except for line terminators) and * matches the previous token between zero and unlimited times, as many times as possible?
Because * is a greede quantifier, while *? is a lazy quantifier. Difference is given on the images below:

You can use Positive Lookbehind (?<=...) and Positive Lookahead (?=...) to extract what is between <tr> and <tr/> tag skiping this tag itself:

This way you extrac exactly one row od data. Now it's time to parse this row itself.

With expression

you can extract information about team's position:

Similarly team number:

team name:

person order and time:

and total team time:

Below you have a full code to complete this task:

When tested on first two records (saved in html_record_1_2.htm file and html directory):

it returns the following result:

If everything is correct, you can remove html_record_1_2.htm file from the html directory and put there all your html file from previous step.

In case of any problems, you can download my data.csv file.

At this moment, you can make any data analysis you want. Next part of this material is devoted to some analysis, but of course you are free to squeeze as much information as you only can.

Data analysis

In this part I will use Google Colab(oratory). I'm not a big fan of a such solutions, but for didactic purposes it is realy perfect.

Start with uploading data.csv file to your Google Drive; I saved it in pbr_2021 directory:

Next you have to login to your Google Colab and start new notebook either just afer login (left image below) or later selecting right option from top menu (right image below):

 Starting new notebook just after login (left) or later selecting right option from top menu (right)

Name it pbr_2021:

When you see your Google Drive files in the left pane (file explorer), right click on the file that you need to import and select Copy path. Then import as usual in pandas, using this copied path:

and run it. Ups...

Checking you will see:

Funny! As you can see, in this case team's name is Run, Forecast, run!. Commas in team's name are interpreted as field separator!

Go back to htm2csv.py file and replace , with ;. Find following lines:

and do replacement six times:

After this modification you can run your script again and check results:

In case of any problems, you can download my data.csv file.

Upload new csv file to Google drive and again try to load data. Now you encounter similar problem but in line 2168:

Let's check what is wrong in this case.

Hmm... this line looks good. I will try to load less data

As you can see, last loaded row is for team AssecoBS 4, but when you check this line in terminal, you will see something different:

I have no idea, why row number reported in Google Colab is different than line number in csv file. It's strange but you have to find correct line. I will use grep with -n option too find line number of last correctly loaded line:

Ok, so you have to check line number 2810:

And you have the culprit: this time team's name is Running is my life ;-) with ; character inside!

Once again you have to change field separator. This time I will use ^ character.

Anyway, I was curious this line number differences, so I loaded csv with commas into spredsheet and I have found this:

What?!

The problem starts in row number 418:

and ends in row number 1060:

Making fast computations:

If you look carefully into team's name in row 418: "Znakomicie, wszyscy!” you will see that two different chars are used to delimit string: opening is " while closing is ”, so this is why all next rows till row number 1060 is treated as one long string.

To fix this, you have to manually replace ” with " to the form:

Hooray! 4663 rows are loaded successfully which agrees with total number of rows in this file:

Very basic analysis

• Display some informations about data types used to represent the data:

• Extract column at index 3 which is a column with total teams' time:

• Convert data to timedelta type to be able to perform time calculations:

• Convert timedelta to number of seconds:

• Do some simple histograms:

• Calculate mean:

• Convert seconds to time delta:

• Calculate mode: