When data begins to speak

In this tutorial you will see when data begins to speak -- this is an example of web scrapping and turning data into something useful. You will learn how to grab data, transform them and analyse.

Table of contents:

Data source characteristic
Examine data source
Getting raw data
Save data in handy data structure
Data analysis
- Loading data
- Very basic analysis

Data source characteristic

This year, again, I had a chance to participate in Poland Business Run charity run. Formula of this run is quite simple:

Companies submit four-person teams to take part in the competition paying for each the starting fee - this is how funds are collected.
Because of the pandemia, this year, similarly to 2020, you could cover the distance (4km) anywhere in the world and you could do this on your own, with your teammates or any other group, according to your preferences.
The route must be completed between 00:00 and 18:00 Central European Summer Time (CEST).
After finishing the race, participants individually submit their results via the mobile application or a website.
Finally, when run ends, you can check how fast you were visiting results web page.

Visiting results web page is how you can check your personal or your team results. This is also what you can try to use to make simple date to speak to you and to turn simple data into something valuable.

Examine data source

This year 26799 runners has finished Poland Business Run. A lot of people, a lot of data. Let's see how you can get them.

Visiting results web page you should see something like this:

After scroling the table down, you will notice that data is paginated, divided into 50-records chunks:

At the bottom you can find two buttons to move over results with 50-records steps. If you hover over one of them, you will notice unchanged hyper reference:

https://live.sts-timing.pl/businessrun2021/

1	https://live.sts-timing.pl/businessrun2021/

Now there are two options:

All data are now in your web browser and only a small part is presented with JavaScript.
Only 50 records are now in your web browser and other chunks are loaded asynchronously with JavaScript.

Let's verify which option was used in our case. Opening developer tools in my browser (in Safari select: Programowanie/Pokaż kod źródłowy strony or press Option-Command-U; alternatively you can select Programowanie/Pokaż inspektora www, or press Option-Command-I; then select Sources) for results web page I get the following result:

If you prefer you can use another view showing webpage elements:

Whatever you select, you can notice that record numer 50 is the last record. There is no more records at this moment and there are no hidden elements which are displayed by JavaScript. Base on this, you can infer that second option, with asynchronous load is used in this case.

Now you have to investigate what happens when you press Next button.

I selected Network tab in my browser and then pressed Next button. The result is given below:

As you can see there was a call to

api.php

api.php

file. If you hover over file name, you will get a full request:

which in this case takes the following form:

https://live.sts-timing.pl/businessrun2021/api.php?action=load-more-results&i=50&d=0&p=false&filtr=false

1	https://live.sts-timing.pl/businessrun2021/api.php?action=load-more-results&i=50&d=0&p=false&filtr=false

A full request information are available if you click on file name:

If you click again the Next button, request will take the form:

https://live.sts-timing.pl/businessrun2021/api.php?action=load-more-results&i=100&d=0&p=false&filtr=false

1	https://live.sts-timing.pl/businessrun2021/api.php?action=load-more-results&i=100&d=0&p=false&filtr=false

Open a new tab in your browser and use the following URL:

https://live.sts-timing.pl/businessrun2021/api.php?action=load-more-results&i=150&d=0&p=false&filtr=false

1	https://live.sts-timing.pl/businessrun2021/api.php?action=load-more-results&i=150&d=0&p=false&filtr=false

This should load results from 151 to 200 and this is what you really get (pbr_2021_00150.json).

Perfect! Now you know general request pattern, so everything you need is a piece of code to grab all data.

Getting raw data

To grab all data you can use short Python script given below. Before you run this code be sure to create raw_data directory in the same location where you keep script file.

import os
import requests

DIR_RAW_DATA = 'raw_data'

def get_file_name(sequence_start):
	return "pbr_2021_{sequence_start:05d}.json".format(sequence_start=sequence_start)

# There are 50 teams per page and 5 runners in each team
# 26799 / 5 = 5359,8 = 5360
for sequence_start in range(0, 5360+50, 50):
	request_params = {'action': 'load-more-results',
	'i': sequence_start,
	'd': 0,
	'p': False,
	'filtr': False}

	result = requests.get('https://live.sts-timing.pl/businessrun2021/api.php',
	params = request_params)

	file_name = get_file_name(sequence_start)
	
	if result.status_code == 200:	
		print("Done sequence starting from {sequence_start}, file name={file_name}".format(sequence_start=sequence_start, file_name=file_name))
	    
		with open(os.path.join(DIR_RAW_DATA, file_name), 'wb') as file_out:
			file_out.write(result.content)
	else:
		print("Problem with {file_name} file".format(file_name=file_name))

import os

import requests

DIR_RAW_DATA = 'raw_data'

def get_file_name(sequence_start):

return "pbr_2021_{sequence_start:05d}.json".format(sequence_start=sequence_start)

# There are 50 teams per page and 5 runners in each team

# 26799 / 5 = 5359,8 = 5360

for sequence_start in range(0, 5360+50, 50):

request_params = {'action': 'load-more-results',

'i': sequence_start,

'd': 0,

'p': False,

'filtr': False}

result = requests.get('https://live.sts-timing.pl/businessrun2021/api.php',

params = request_params)

file_name = get_file_name(sequence_start)

if result.status_code == 200:

print("Done sequence starting from {sequence_start}, file name={file_name}".format(sequence_start=sequence_start, file_name=file_name))

with open(os.path.join(DIR_RAW_DATA, file_name), 'wb') as file_out:

file_out.write(result.content)

else:

print("Problem with {file_name} file".format(file_name=file_name))

Note:
If your directory has longer path, you can build it as given below:

a_list = ["directory", "path", "to", "file"]
file_path = os.path.join(*a_list)
print(file_path)

OUTPUT

directory/path/to/file

a_list = ["directory", "path", "to", "file"]

file_path = os.path.join(*a_list)

print(file_path)

OUTPUT

directory/path/to/file

The * operator unpacks a container by producing individual elements of the container.
End note

Note:
If you work in Googl Colaboratory remember to mount your drive:

from google.colab import drive
drive.mount('/content/drive')

1 2	from google.colab import drive drive.mount('/content/drive')

and next you can use your folder (in the following example my 'base folder' is bd):

pathComponents = ['/content','drive','MyDrive','bd',DIR_RAW_DATA]
path = os.path.join(*pathComponents, fileName)

with open(path, 'wb') as fileOut:
  fileOut.write(result.content)

pathComponents = ['/content','drive','MyDrive','bd',DIR_RAW_DATA]

path = os.path.join(*pathComponents, fileName)

with open(path, 'wb') as fileOut:

fileOut.write(result.content)

End note

It will take a few seconds to get all data. After that you can check contents of raw_data directory:

(base) fulmanp@fulmanp-T500:~/pbr2021$ ls -la raw_data/
razem 3452
drwxrwxr-x 2 fulmanp fulmanp 24576 wrz 22 14:32 .
drwxrwxr-x 3 fulmanp fulmanp  4096 wrz 22 14:31 ..
-rw-rw-r-- 1 fulmanp fulmanp 34050 wrz 22 14:32 pbr_2021_00000.json
-rw-rw-r-- 1 fulmanp fulmanp 33798 wrz 22 14:32 pbr_2021_00050.json
[...]
-rw-rw-r-- 1 fulmanp fulmanp 34021 wrz 22 14:32 pbr_2021_04550.json
-rw-rw-r-- 1 fulmanp fulmanp 34149 wrz 22 14:32 pbr_2021_04600.json
-rw-rw-r-- 1 fulmanp fulmanp  8875 wrz 22 14:32 pbr_2021_04650.json
-rw-rw-r-- 1 fulmanp fulmanp    23 wrz 22 14:32 pbr_2021_04700.json
[...]
-rw-rw-r-- 1 fulmanp fulmanp    23 wrz 22 14:32 pbr_2021_05400.json

(base) fulmanp@fulmanp-T500:~/pbr2021$ ls -la raw_data/

razem 3452

drwxrwxr-x 2 fulmanp fulmanp 24576 wrz 22 14:32 .

drwxrwxr-x 3 fulmanp fulmanp 4096 wrz 22 14:31 ..

-rw-rw-r-- 1 fulmanp fulmanp 34050 wrz 22 14:32 pbr_2021_00000.json

-rw-rw-r-- 1 fulmanp fulmanp 33798 wrz 22 14:32 pbr_2021_00050.json

[...]

-rw-rw-r-- 1 fulmanp fulmanp 34021 wrz 22 14:32 pbr_2021_04550.json

-rw-rw-r-- 1 fulmanp fulmanp 34149 wrz 22 14:32 pbr_2021_04600.json

-rw-rw-r-- 1 fulmanp fulmanp 8875 wrz 22 14:32 pbr_2021_04650.json

-rw-rw-r-- 1 fulmanp fulmanp 23 wrz 22 14:32 pbr_2021_04700.json

[...]

-rw-rw-r-- 1 fulmanp fulmanp 23 wrz 22 14:32 pbr_2021_05400.json

As you can see, pbr_2021_00000.json file has 34050 bytes (approximately 33.25)KiB, next file, pbr_2021_00050.json file has 33798 bytes (approximately 33.0KiB), and so on, till pbr_2021_04650.json file which has much less than 30kB -- to be exact, it has 8875 bytes (approximately 8.66 KiB). All the other files, starting from pbr_2021_04700.json to pbr_2021_05400.json have exactly 23 bytes which suggest that all of them are identical and no usable contents. To be sure, you can check contents of few files:

pbr_2021_04600.json file, has the following contents:

{"html":"<tr><td data-title=\"Miejsce\" data-breakpoints=\"xs\" style=\"white-space:nowrap;\">4601<\/td>[...]<tr><td data-title=\"Miejsce\" data-breakpoints=\"xs\" style=\"white-space:nowrap;\">4650<\/td><td>2728<\/td><td style=\"font-weight:bold;font-size:14px;\"><b>Z zadyszk\u0105 w stron\u0119 s\u0142o\u0144ca (2728)<\/b><\/td><td class=\"text-left\">(2728\/1) <a href=\"person.php?n=2728\/1\">Kr\u00f3l - Maziarz Karolina<\/a> - 01:30:53<br>(2728\/2) <a href=\"person.php?n=2728\/2\">Kokoszka Kamil<\/a> - 00:23:00<br>(2728\/3) <a href=\"person.php?n=2728\/3\">Kad\u0142ubowska-Mruk Sylwia<\/a> - 00:44:01<br>(2728\/4) <a href=\"person.php?n=2728\/4\">Kuchta Bartosz<\/a> - 01:59:58<br>(2728\/5) <a href=\"person.php?n=2728\/5\">Szymczak Adam<\/a> - 00:42:14<br><td class=\"t\">05:20:06<\/td><td class=\"st\">-<\/td><\/tr>","last":4650}

{"html":"<tr><td data-title=\"Miejsce\" data-breakpoints=\"xs\" style=\"white-space:nowrap;\">4601<\/td>[...]<tr><td data-title=\"Miejsce\" data-breakpoints=\"xs\" style=\"white-space:nowrap;\">4650<\/td><td>2728<\/td><td style=\"font-weight:bold;font-size:14px;\">Z zadyszk\u0105 w stron\u0119 s\u0142o\u0144ca (2728)<\/b><\/td><td class=\"text-left\">(2728\/1) <a href=\"person.php?n=2728\/1\">Kr\u00f3l - Maziarz Karolina<\/a> - 01:30:53 (2728\/2) <a href=\"person.php?n=2728\/2\">Kokoszka Kamil<\/a> - 00:23:00 (2728\/3) <a href=\"person.php?n=2728\/3\">Kad\u0142ubowska-Mruk Sylwia<\/a> - 00:44:01 (2728\/4) <a href=\"person.php?n=2728\/4\">Kuchta Bartosz<\/a> - 01:59:58 (2728\/5) <a href=\"person.php?n=2728\/5\">Szymczak Adam<\/a> - 00:42:14 <td class=\"t\">05:20:06<\/td><td class=\"st\">-<\/td><\/tr>","last":4650}

Data starts from 4601 to 4650, so there are 50 records inside.

pbr_2021_04650.json file, which looks like below:

{"html":"<tr><td data-title=\"Miejsce\" data-breakpoints=\"xs\" style=\"white-space:nowrap;\">4651<\/td>[...]<tr><td data-title=\"Miejsce\" data-breakpoints=\"xs\" style=\"white-space:nowrap;\">4663<\/td><td>1027<\/td><td style=\"font-weight:bold;font-size:14px;\"><b>Levi's Wroc\u0142aw (1027)<\/b><\/td><td class=\"text-left\">(1027\/1) <a href=\"person.php?n=1027\/1\">Czerniak Adam<\/a> - 01:59:58<br>(1027\/2) <a href=\"person.php?n=1027\/2\">Szyma\u0144ski Dominik<\/a> - 01:54:20<br>(1027\/3) <a href=\"person.php?n=1027\/3\">Kasprzyk Patrycja<\/a> - 01:05:47<br>(1027\/4) <a href=\"person.php?n=1027\/4\">\u0141acina Alicja<\/a> - 01:25:53<br>(1027\/5) <a href=\"person.php?n=1027\/5\">Smog\u00f3r Micha\u0142<\/a> - 01:15:25<br><td class=\"t\">07:41:23<\/td><td class=\"st\">-<\/td><\/tr>","last":4663}

{"html":"<tr><td data-title=\"Miejsce\" data-breakpoints=\"xs\" style=\"white-space:nowrap;\">4651<\/td>[...]<tr><td data-title=\"Miejsce\" data-breakpoints=\"xs\" style=\"white-space:nowrap;\">4663<\/td><td>1027<\/td><td style=\"font-weight:bold;font-size:14px;\">Levi's Wroc\u0142aw (1027)<\/b><\/td><td class=\"text-left\">(1027\/1) <a href=\"person.php?n=1027\/1\">Czerniak Adam<\/a> - 01:59:58 (1027\/2) <a href=\"person.php?n=1027\/2\">Szyma\u0144ski Dominik<\/a> - 01:54:20 (1027\/3) <a href=\"person.php?n=1027\/3\">Kasprzyk Patrycja<\/a> - 01:05:47 (1027\/4) <a href=\"person.php?n=1027\/4\">\u0141acina Alicja<\/a> - 01:25:53 (1027\/5) <a href=\"person.php?n=1027\/5\">Smog\u00f3r Micha\u0142<\/a> - 01:15:25 <td class=\"t\">07:41:23<\/td><td class=\"st\">-<\/td><\/tr>","last":4663}

Data starts from 4651 to 4663, so the difference is less than expected 50.

pbr_2021_04700.json file, which is really simple:

{"html":"","last":4700}

1

{"html":"","last":4700}

Yes, there is nothing interesting in these files.

When you have all raw data, it's time to extract html code. You can complete this with simple code given below (extract_html.py):

import json
import os

DIR_RAW_DATA = r'raw_data'
DIR_HTML = 'html'

directory = DIR_RAW_DATA

for fileName in os.listdir(directory):
	if fileName.endswith(".json"):
		print('Read file {fileName}'.format(fileName=fileName))
		with open(os.path.join(DIR_RAW_DATA, fileName)) as file_in:
			
			
			fileNameNew = fileName.replace(".json", ".htm")
			
			with open(os.path.join(DIR_HTML, fileNameNew), 'wt') as file_out:
				dictionary = json.load(file_in)
				file_out.write(dictionary["html"])

import json

import os

DIR_RAW_DATA = r'raw_data'

DIR_HTML = 'html'

directory = DIR_RAW_DATA

for fileName in os.listdir(directory):

if fileName.endswith(".json"):

print('Read file {fileName}'.format(fileName=fileName))

with open(os.path.join(DIR_RAW_DATA, fileName)) as file_in:

fileNameNew = fileName.replace(".json", ".htm")

with open(os.path.join(DIR_HTML, fileNameNew), 'wt') as file_out:

dictionary = json.load(file_in)

file_out.write(dictionary["html"])

Unfortunately at some point you will get result similar to the following:

Read file pbr_2021_02250.json
Traceback (most recent call last):
  File "extract_html.py", line 14, in <module>
    dictionary = json.load(file_in)
  File "/home/fulmanp/anaconda3/lib/python3.7/json/__init__.py", line 296, in load
    parse_constant=parse_constant, object_pairs_hook=object_pairs_hook, **kw)
  File "/home/fulmanp/anaconda3/lib/python3.7/json/__init__.py", line 348, in loads
    return _default_decoder.decode(s)
  File "/home/fulmanp/anaconda3/lib/python3.7/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/home/fulmanp/anaconda3/lib/python3.7/json/decoder.py", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

Read file pbr_2021_02250.json

Traceback (most recent call last):

File "extract_html.py", line 14, in <module>

dictionary = json.load(file_in)

File "/home/fulmanp/anaconda3/lib/python3.7/json/__init__.py", line 296, in load

parse_constant=parse_constant, object_pairs_hook=object_pairs_hook, **kw)

File "/home/fulmanp/anaconda3/lib/python3.7/json/__init__.py", line 348, in loads

return _default_decoder.decode(s)

File "/home/fulmanp/anaconda3/lib/python3.7/json/decoder.py", line 337, in decode

obj, end = self.raw_decode(s, idx=_w(s, 0).end())

File "/home/fulmanp/anaconda3/lib/python3.7/json/decoder.py", line 355, in raw_decode

raise JSONDecodeError("Expecting value", s, err.value) from None

json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

This means that there is something wrong with pbr_2021_02250.json file. If you take a closer look into its contents, you will notice unexpected message inside:

<br />
<b>Notice</b>:  Trying to get property of non-object in <b>/var/www/html/businessrun2021/api.php</b> on line <b>161</b><br />
{"html":"<tr><td data-title=\"Miejsce\" data-breakpoints=\"xs\" style=\"white-space:nowrap;\">2251<\/td><td>4591<\/td>[...]<tr><td data-title=\"Miejsce\" data-breakpoints=\"xs\" style=\"white-space:nowrap;\">2300<\/td><td>5278<\/td><td style=\"font-weight:bold;font-size:14px;\"><b>Bee Talents (5278)<\/b><\/td><td class=\"text-left\">(5278\/1) <a href=\"person.php?n=5278\/1\">Domachowska Dominka<\/a> - 00:20:12<br>(5278\/2) <a href=\"person.php?n=5278\/2\">Durka Natalia<\/a> - 00:34:39<br>(5278\/3) <a href=\"person.php?n=5278\/3\">Grzesiak Olga<\/a> - 00:25:05<br>(5278\/4) <a href=\"person.php?n=5278\/4\">Ganowicz Igor<\/a> - 00:20:06<br>(5278\/5) <a href=\"person.php?n=5278\/5\">Szulakiewicz Maciej<\/a> - 00:23:48<br><td class=\"t\">02:03:50<\/td><td class=\"st\">-<\/td><\/tr>","last":2300}

Notice: Trying to get property of non-object in /var/www/html/businessrun2021/api.php on line 161

{"html":"<tr><td data-title=\"Miejsce\" data-breakpoints=\"xs\" style=\"white-space:nowrap;\">2251<\/td><td>4591<\/td>[...]<tr><td data-title=\"Miejsce\" data-breakpoints=\"xs\" style=\"white-space:nowrap;\">2300<\/td><td>5278<\/td><td style=\"font-weight:bold;font-size:14px;\">Bee Talents (5278)<\/b><\/td><td class=\"text-left\">(5278\/1) <a href=\"person.php?n=5278\/1\">Domachowska Dominka<\/a> - 00:20:12 (5278\/2) <a href=\"person.php?n=5278\/2\">Durka Natalia<\/a> - 00:34:39 (5278\/3) <a href=\"person.php?n=5278\/3\">Grzesiak Olga<\/a> - 00:25:05 (5278\/4) <a href=\"person.php?n=5278\/4\">Ganowicz Igor<\/a> - 00:20:06 (5278\/5) <a href=\"person.php?n=5278\/5\">Szulakiewicz Maciej<\/a> - 00:23:48 <td class=\"t\">02:03:50<\/td><td class=\"st\">-<\/td><\/tr>","last":2300}

Remove non-json part:

<br />
<b>Notice</b>:  Trying to get property of non-object in <b>/var/www/html/businessrun2021/api.php</b> on line <b>161</b><br />

1 2	<br /> <b>Notice</b>: Trying to get property of non-object in <b>/var/www/html/businessrun2021/api.php</b> on line <b>161</b><br />

from pbr_2021_02250.json file and again run extract_html.py script. This time your run shouldn't be interrupted by unexpected data.

Finally at this step you should get html code (html.zip) extracted from raw data you grabbed at the beginning (raw_data_original.zip).

Save data in handy data structure

HTML file is not bad data format, but a little bit difficult to further processing, so it would be nice to have all of them in more handy form, like for example csv.

To make this transformation, you have to examine HTML content first. You can select any HTML file, for example pbr_2021_00000.htm:

<tr><td data-title="Miejsce" data-breakpoints="xs" style="white-space:nowrap;">1</td><td>245</td><td style="font-weight:bold;font-size:14px;"><b>#AdamCzerwińskiTeam (245)</b></td><td class="text-left">(245/1) <a href="person.php?n=245/1">Czerwiński Adam</a> - 00:11:35<br>(245/2) <a href="person.php?n=245/2">Piwowarczyk Dominik</a> - 00:12:25<br>(245/3) <a href="person.php?n=245/3">Gajdowski Oliver</a> - 00:12:31<br>(245/4) <a href="person.php?n=245/4">Wrona Daniel</a> - 00:13:16<br>(245/5) <a href="person.php?n=245/5">Ogorzały Tomasz</a> - 00:12:47<br><td class="t">01:02:34</td><td class="st">-</td></tr> [...]

Unfortunatley it is a one very long line. Rearanging HTML for better visualisation you will see:

<tr>
  <td data-title="Miejsce" data-breakpoints="xs" style="white-space:nowrap;">1</td>
  <td>245</td>
  <td style="font-weight:bold;font-size:14px;"><b>#AdamCzerwińskiTeam (245)</b></td>
  <td class="text-left">
    (245/1)<a href="person.php?n=245/1">Czerwiński Adam</a> - 00:11:35<br>
    (245/2)<a href="person.php?n=245/2">Piwowarczyk Dominik</a> - 00:12:25<br>
    (245/3)<a href="person.php?n=245/3">Gajdowski Oliver</a> - 00:12:31<br>
    (245/4)<a href="person.php?n=245/4">Wrona Daniel</a> - 00:13:16<br>
    (245/5)<a href="person.php?n=245/5">Ogorzały Tomasz</a> - 00:12:47<br>
  <td class="t">01:02:34</td>
  <td class="st">-</td>
</tr>
[...]

<tr>

<td style="font-weight:bold;font-size:14px;">#AdamCzerwińskiTeam (245)</td>

(245/1)<a href="person.php?n=245/1">Czerwiński Adam</a> - 00:11:35

(245/2)<a href="person.php?n=245/2">Piwowarczyk Dominik</a> - 00:12:25

(245/3)<a href="person.php?n=245/3">Gajdowski Oliver</a> - 00:12:31

(245/4)<a href="person.php?n=245/4">Wrona Daniel</a> - 00:13:16

(245/5)<a href="person.php?n=245/5">Ogorzały Tomasz</a> - 00:12:47

</tr>

[...]

Now you can see, that this HTML is not perfectly correct, because the closing tag for <td class="text-left"> is missing. To parse this HTML you can use ready to use libraries like BeautifulSoup. For production purposes this is a good approach, but now I want to show you how you can complete this with powerful regular expressions.

First, you will find all rows -- group of characters between <tr> and <tr/> tag. You can use the following regular expression:

(?P<row><tr>.*?</tr>)

1	(?P<row><tr>.*?</tr>)

Why .*? instead of simply .* where . matches any character (except for line terminators) and * matches the previous token between zero and unlimited times, as many times as possible?
Because * is a greede quantifier, while *? is a lazy quantifier. Difference is given on the images below:

You can use Positive Lookbehind (?<=...) and Positive Lookahead (?=...) to extract what is between <tr> and <tr/> tag skiping this tag itself:

(?P<row>(?<=<tr>).*?(?=</tr>))

1	(?P<row>(?<=<tr>).*?(?=</tr>))

This way you extrac exactly one row od data. Now it's time to parse this row itself.

With expression

(?<=<td data).*?>(?P<position>\d*)

1	(?<=<td data).?>(?P<position>\d)

you can extract information about team's position:

Similarly team number:

(?<=<td>)(?P<teamNumber>\d*)

1	(?<=<td>)(?P<teamNumber>\d*)

team name:

(?<=<td style).*?<b>(?P<teamName>.*?)\s\(\d*\)</b>

1	(?<=<td style).?<b>(?P<teamName>.?)\s$\d*$</b>

person order and time:

\(\d*/(?P<personNumber>\d*)\).*?<a.*?>(?P<personName>.*?)</a>.*?(?P<personTime>\d{2}:\d{2}:\d{2})

1	$\d/(?P<personNumber>\d)$.?<a.?>(?P<personName>.?)</a>.?(?P<personTime>\d{2}:\d{2}:\d{2})

and total team time:

(?<=<td class=\"t\">)(?P<totalTime>\d{2}:\d{2}:\d{2})

1	(?<=<td class=\"t\">)(?P<totalTime>\d{2}:\d{2}:\d{2})

Below you have a full code to complete this task:

import os
import re

currentDirectory = os.path.abspath(os.path.dirname(__file__))
currentDirectory = os.path.join(currentDirectory, "html")
    

def processFile(filePath, fileToSaveData):
  
  reRow = re.compile("(?P<row>(?<=<tr>).*?(?=</tr>))")
  rePosition = re.compile("(?<=<td data).*?>(?P<position>\d*)")
  reTeamNumber = re.compile("(?<=<td>)(?P<teamNumber>\d*)")
  reTeamName = re.compile("(?<=<td style).*?<b>(?P<teamName>.*?)\s\(\d*\)</b>")
  rePersonalData = re.compile("\(\d*/(?P<personNumber>\d*)\).*?<a.*?>(?P<personName>.*?)</a>.*?(?P<personTime>\d{2}:\d{2}:\d{2})")
  reTotalTime = re.compile("(?<=<td class=\"t\">)(?P<totalTime>\d{2}:\d{2}:\d{2})")
  
  with open(filePath) as fileIn:
    lines = fileIn.readlines()
    text = "".join(lines)
    result = re.findall(reRow, text)
    
    if result:      
      for row in result:
        position = re.findall(rePosition, row)
        teamNumber = re.findall(reTeamNumber, row)
        teamName = re.findall(reTeamName, row)        
        totalTime = re.findall(reTotalTime, row)
        personalData = re.findall(rePersonalData, row)
        r = f"{position[0]},{teamNumber[0]},{teamName[0]},{totalTime[0]}"
        
        person = []
        time = []
        
        for _person in personalData:
          _number, _name, _time = _person
          person.append((_number, _name))
          time.append(_time)
          
        for t in time:
          r += f",{t}"
          
        for p in person:
          r += f",{p[0]},{p[1]}"
        
        fileToSaveData.write(r + "\n")  


with open("data.csv", 'w') as fileToSaveData:  
  for fileName in os.listdir(currentDirectory):
    if fileName.endswith(".htm"):
      filePath = os.path.join(currentDirectory, fileName)
      size = os.path.getsize(filePath)
      if size == 0:
        print("Skip file {} because of its size {}B".format(fileName, str(size)))
      else:
        processFile(filePath, fileToSaveData)
    else:
      print("Skip file {} as non-HTML file".format(fileName))

import os

import re

currentDirectory = os.path.abspath(os.path.dirname(__file__))

currentDirectory = os.path.join(currentDirectory, "html")

def processFile(filePath, fileToSaveData):

reRow = re.compile("(?P<row>(?<=<tr>).*?(?=</tr>))")

rePosition = re.compile("(?<=<td data).*?>(?P<position>\d*)")

reTeamNumber = re.compile("(?<=<td>)(?P<teamNumber>\d*)")

reTeamName = re.compile("(?<=<td style).*?(?P<teamName>.*?)\s$\d*$")

rePersonalData = re.compile("$\d*/(?P<personNumber>\d*)$.*?<a.*?>(?P<personName>.*?)</a>.*?(?P<personTime>\d{2}:\d{2}:\d{2})")

reTotalTime = re.compile("(?<=<td class=\"t\">)(?P<totalTime>\d{2}:\d{2}:\d{2})")

with open(filePath) as fileIn:

lines = fileIn.readlines()

text = "".join(lines)

result = re.findall(reRow, text)

if result:

for row in result:

position = re.findall(rePosition, row)

teamNumber = re.findall(reTeamNumber, row)

teamName = re.findall(reTeamName, row)

totalTime = re.findall(reTotalTime, row)

personalData = re.findall(rePersonalData, row)

r = f"{position[0]},{teamNumber[0]},{teamName[0]},{totalTime[0]}"

person = []

time = []

for _person in personalData:

_number, _name, _time = _person

person.append((_number, _name))

time.append(_time)

for t in time:

r += f",{t}"

for p in person:

r += f",{p[0]},{p[1]}"

fileToSaveData.write(r + "\n")

with open("data.csv", 'w') as fileToSaveData:

for fileName in os.listdir(currentDirectory):

if fileName.endswith(".htm"):

filePath = os.path.join(currentDirectory, fileName)

size = os.path.getsize(filePath)

if size == 0:

print("Skip file {} because of its size {}B".format(fileName, str(size)))

else:

processFile(filePath, fileToSaveData)

else:

print("Skip file {} as non-HTML file".format(fileName))

When tested on first two records (saved in html_record_1_2.htm file and html directory):

<tr><td data-title="Miejsce" data-breakpoints="xs" style="white-space:nowrap;">1</td><td>245</td><td style="font-weight:bold;font-size:14px;"><b>#AdamCzerwińskiTeam (245)</b></td><td class="text-left">(245/1) <a href="person.php?n=245/1">Czerwiński Adam</a> - 00:11:35<br>(245/2) <a href="person.php?n=245/2">Piwowarczyk Dominik</a> - 00:12:25<br>(245/3) <a href="person.php?n=245/3">Gajdowski Oliver</a> - 00:12:31<br>(245/4) <a href="person.php?n=245/4">Wrona Daniel</a> - 00:13:16<br>(245/5) <a href="person.php?n=245/5">Ogorzały Tomasz</a> - 00:12:47<br><td class="t">01:02:34</td><td class="st">-</td></tr><tr><td data-title="Miejsce" data-breakpoints="xs" style="white-space:nowrap;">2</td><td>3887</td><td style="font-weight:bold;font-size:14px;"><b>MANUFAKTURA MIESZKAŃ (3887)</b></td><td class="text-left">(3887/1) <a href="person.php?n=3887/1">Sobczyk Patryk</a> - 00:13:17<br>(3887/2) <a href="person.php?n=3887/2">Jastrzębski Sebastian</a> - 00:12:32<br>(3887/3) <a href="person.php?n=3887/3">Gorzelańczyk Paweł</a> - 00:14:32<br>(3887/4) <a href="person.php?n=3887/4">Graczyk Adrian</a> - 00:12:30<br>(3887/5) <a href="person.php?n=3887/5">Mikołajczak Hubert</a> - 00:14:29<br><td class="t">01:07:20</td><td class="st">-</td></tr>

<tr><td data-title="Miejsce" data-breakpoints="xs" style="white-space:nowrap;">1</td><td>245</td><td style="font-weight:bold;font-size:14px;">#AdamCzerwińskiTeam (245)</td><td class="text-left">(245/1) <a href="person.php?n=245/1">Czerwiński Adam</a> - 00:11:35 (245/2) <a href="person.php?n=245/2">Piwowarczyk Dominik</a> - 00:12:25 (245/3) <a href="person.php?n=245/3">Gajdowski Oliver</a> - 00:12:31 (245/4) <a href="person.php?n=245/4">Wrona Daniel</a> - 00:13:16 (245/5) <a href="person.php?n=245/5">Ogorzały Tomasz</a> - 00:12:47 <td class="t">01:02:34</td><td class="st">-</td></tr><tr><td data-title="Miejsce" data-breakpoints="xs" style="white-space:nowrap;">2</td><td>3887</td><td style="font-weight:bold;font-size:14px;">MANUFAKTURA MIESZKAŃ (3887)</td><td class="text-left">(3887/1) <a href="person.php?n=3887/1">Sobczyk Patryk</a> - 00:13:17 (3887/2) <a href="person.php?n=3887/2">Jastrzębski Sebastian</a> - 00:12:32 (3887/3) <a href="person.php?n=3887/3">Gorzelańczyk Paweł</a> - 00:14:32 (3887/4) <a href="person.php?n=3887/4">Graczyk Adrian</a> - 00:12:30 (3887/5) <a href="person.php?n=3887/5">Mikołajczak Hubert</a> - 00:14:29 <td class="t">01:07:20</td><td class="st">-</td></tr>

it returns the following result:

macbook-piotr:talk_with_data fulmanp$ ls -l
total 24
-rw-r--r--@ 1 fulmanp  staff  1045 31 paź 08:29 data_grabber.py
-rw-r--r--@ 1 fulmanp  staff   493 31 paź 08:29 extract_html.py
-rw-r--r--@ 1 fulmanp  staff  2027  4 sty 10:43 htm2csv.py
drwxr-xr-x@ 3 fulmanp  staff    96  4 sty 10:47 html
macbook-piotr:talk_with_data fulmanp$ python3 htm2csv.py 
macbook-piotr:talk_with_data fulmanp$ ls -l
total 32
-rw-r--r--  1 fulmanp  staff   362  4 sty 10:49 data.csv
-rw-r--r--@ 1 fulmanp  staff  1045 31 paź 08:29 data_grabber.py
-rw-r--r--@ 1 fulmanp  staff   493 31 paź 08:29 extract_html.py
-rw-r--r--@ 1 fulmanp  staff  2027  4 sty 10:43 htm2csv.py
drwxr-xr-x@ 3 fulmanp  staff    96  4 sty 10:47 html
macbook-piotr:talk_with_data fulmanp$ wc data.csv 
       2      13     362 data.csv
macbook-piotr:talk_with_data fulmanp$ cat data.csv 
1,245,#AdamCzerwińskiTeam,01:02:34,00:11:35,00:12:25,00:12:31,00:13:16,00:12:47,1,Czerwiński Adam,2,Piwowarczyk Dominik,3,Gajdowski Oliver,4,Wrona Daniel,5,Ogorzały Tomasz
2,3887,MANUFAKTURA MIESZKAŃ,01:07:20,00:13:17,00:12:32,00:14:32,00:12:30,00:14:29,1,Sobczyk Patryk,2,Jastrzębski Sebastian,3,Gorzelańczyk Paweł,4,Graczyk Adrian,5,Mikołajczak Hubert

macbook-piotr:talk_with_data fulmanp$ ls -l

total 24

-rw-r--r--@ 1 fulmanp staff 1045 31 paź 08:29 data_grabber.py

-rw-r--r--@ 1 fulmanp staff 493 31 paź 08:29 extract_html.py

-rw-r--r--@ 1 fulmanp staff 2027 4 sty 10:43 htm2csv.py

drwxr-xr-x@ 3 fulmanp staff 96 4 sty 10:47 html

macbook-piotr:talk_with_data fulmanp$ python3 htm2csv.py

macbook-piotr:talk_with_data fulmanp$ ls -l

total 32

-rw-r--r-- 1 fulmanp staff 362 4 sty 10:49 data.csv

-rw-r--r--@ 1 fulmanp staff 1045 31 paź 08:29 data_grabber.py

-rw-r--r--@ 1 fulmanp staff 493 31 paź 08:29 extract_html.py

-rw-r--r--@ 1 fulmanp staff 2027 4 sty 10:43 htm2csv.py

drwxr-xr-x@ 3 fulmanp staff 96 4 sty 10:47 html

macbook-piotr:talk_with_data fulmanp$ wc data.csv

2 13 362 data.csv

macbook-piotr:talk_with_data fulmanp$ cat data.csv

1,245,#AdamCzerwińskiTeam,01:02:34,00:11:35,00:12:25,00:12:31,00:13:16,00:12:47,1,Czerwiński Adam,2,Piwowarczyk Dominik,3,Gajdowski Oliver,4,Wrona Daniel,5,Ogorzały Tomasz

2,3887,MANUFAKTURA MIESZKAŃ,01:07:20,00:13:17,00:12:32,00:14:32,00:12:30,00:14:29,1,Sobczyk Patryk,2,Jastrzębski Sebastian,3,Gorzelańczyk Paweł,4,Graczyk Adrian,5,Mikołajczak Hubert

If everything is correct, you can remove html_record_1_2.htm file from the html directory and put there all your html file from previous step.

macbook-piotr:talk_with_data fulmanp$ python3 htm2csv.py 
Skip file pbr_2021_04700.htm because of its size 0B
Skip file pbr_2021_05350.htm because of its size 0B
Skip file pbr_2021_05150.htm because of its size 0B
Skip file pbr_2021_04850.htm because of its size 0B
Skip file .DS_Store as non-HTML file
Skip file pbr_2021_05050.htm because of its size 0B
Skip file pbr_2021_04950.htm because of its size 0B
Skip file pbr_2021_05250.htm because of its size 0B
Skip file pbr_2021_05300.htm because of its size 0B
Skip file pbr_2021_04750.htm because of its size 0B
Skip file pbr_2021_05100.htm because of its size 0B
Skip file pbr_2021_04800.htm because of its size 0B
Skip file pbr_2021_05400.htm because of its size 0B
Skip file pbr_2021_05000.htm because of its size 0B
Skip file pbr_2021_04900.htm because of its size 0B
Skip file pbr_2021_05200.htm because of its size 0B
macbook-piotr:talk_with_data fulmanp$ wc data.csv 
    4663   36473  812210 data.csv

macbook-piotr:talk_with_data fulmanp$ python3 htm2csv.py

Skip file pbr_2021_04700.htm because of its size 0B

Skip file pbr_2021_05350.htm because of its size 0B

Skip file pbr_2021_05150.htm because of its size 0B

Skip file pbr_2021_04850.htm because of its size 0B

Skip file .DS_Store as non-HTML file

Skip file pbr_2021_05050.htm because of its size 0B

Skip file pbr_2021_04950.htm because of its size 0B

Skip file pbr_2021_05250.htm because of its size 0B

Skip file pbr_2021_05300.htm because of its size 0B

Skip file pbr_2021_04750.htm because of its size 0B

Skip file pbr_2021_05100.htm because of its size 0B

Skip file pbr_2021_04800.htm because of its size 0B

Skip file pbr_2021_05400.htm because of its size 0B

Skip file pbr_2021_05000.htm because of its size 0B

Skip file pbr_2021_04900.htm because of its size 0B

Skip file pbr_2021_05200.htm because of its size 0B

macbook-piotr:talk_with_data fulmanp$ wc data.csv

4663 36473 812210 data.csv

In case of any problems, you can download my data.csv file.

At this moment, you can make any data analysis you want. Next part of this material is devoted to some analysis, but of course you are free to squeeze as much information as you only can.

Data analysis

Loading data

In this part I will use Google Colab(oratory). I'm not a big fan of a such solutions, but for didactic purposes it is realy perfect.

Start with uploading data.csv file to your Google Drive; I saved it in pbr_2021 directory:

Next you have to login to your Google Colab and start new notebook either just afer login (left image below) or later selecting right option from top menu (right image below):


Starting new notebook just after login (left) or later selecting right option from top menu (right)

Name it pbr_2021:

First, most obvious task, is loading our data. Display all your files:

Connect to your Google Drive:

When you see your Google Drive files in the left pane (file explorer), right click on the file that you need to import and select Copy path. Then import as usual in pandas, using this copied path:

import pandas as pd 
df=pd.read_csv('/content/drive/MyDrive/pbr_2021/data.csv')

1 2	import pandas as pd df=pd.read_csv('/content/drive/MyDrive/pbr_2021/data.csv')

and run it. Ups...

ParserError: Error tokenizing data. C error: Expected 19 fields in line 61, saw 21

1	ParserError: Error tokenizing data. C error: Expected 19 fields in line 61, saw 21

Checking you will see:

2611,3405,Run, Forecast, run!,02:07:57,00:16:05,00:17:51,00:48:27,00:21:53,00:23:41,1,Błasiak Jacek,2,Wzgarda Marcin,3,Skałban Anna,4,Małgorzaciak Liliana,5,Nowak Paweł

1	2611,3405,Run, Forecast, run!,02:07:57,00:16:05,00:17:51,00:48:27,00:21:53,00:23:41,1,Błasiak Jacek,2,Wzgarda Marcin,3,Skałban Anna,4,Małgorzaciak Liliana,5,Nowak Paweł

Funny! As you can see, in this case team's name is Run, Forecast, run!. Commas in team's name are interpreted as field separator!

Go back to htm2csv.py file and replace , with ;. Find following lines:

r = f"{position[0]},{teamNumber[0]},{teamName[0]},{totalTime[0]}"
[...]
r += f",{t}"
[...]
r += f",{p[0]},{p[1]}"

r = f"{position[0]},{teamNumber[0]},{teamName[0]},{totalTime[0]}"

[...]

r += f",{t}"

[...]

r += f",{p[0]},{p[1]}"

and do replacement six times:

r = f"{position[0]};{teamNumber[0]};{teamName[0]};{totalTime[0]}"
[...]
r += f";{t}"
[...]
r += f";{p[0]};{p[1]}"

r = f"{position[0]};{teamNumber[0]};{teamName[0]};{totalTime[0]}"

[...]

r += f";{t}"

[...]

r += f";{p[0]};{p[1]}"

After this modification you can run your script again and check results:

macbook-piotr:talk_with_data fulmanp$ python3 htm2csv.py 
Skip file pbr_2021_04700.htm because of its size 0B
Skip file pbr_2021_05350.htm because of its size 0B
Skip file pbr_2021_05150.htm because of its size 0B
Skip file pbr_2021_04850.htm because of its size 0B
Skip file .DS_Store as non-HTML file
Skip file pbr_2021_05050.htm because of its size 0B
Skip file pbr_2021_04950.htm because of its size 0B
Skip file pbr_2021_05250.htm because of its size 0B
Skip file pbr_2021_05300.htm because of its size 0B
Skip file pbr_2021_04750.htm because of its size 0B
Skip file pbr_2021_05100.htm because of its size 0B
Skip file pbr_2021_04800.htm because of its size 0B
Skip file pbr_2021_05400.htm because of its size 0B
Skip file pbr_2021_05000.htm because of its size 0B
Skip file pbr_2021_04900.htm because of its size 0B
Skip file pbr_2021_05200.htm because of its size 0B
macbook-piotr:talk_with_data fulmanp$ head data.csv 
3251;910;Arjo Finansjera;02:17:45;00:21:56;00:28:03;00:30:59;00:24:49;00:31:58;1;Morkowski Maciej;2;Półrolniczak Jarosław;3;Juśkiewicz Sylwia;4;Świercz Marta;5;Łechtańska Klaudia
3252;2711;KIUSIU;02:17:45;00:37:27;00:18:20;00:24:15;00:26:00;00:31:43;1;Górska Weronika;2;Łysowski Damian;3;Kitowski Kamil;4;Kowalczyk Maria;5;Sprawka Agnieszka
3253;3413;Zabiegani;02:17:45;00:26:50;00:28:32;00:26:14;00:29:00;00:27:09;1;Michałek Daniel;2;Tok Monika;3;Kostrzewska Agata;4;Ryczek Małgorzata;5;Radkowska Magdalena
3254;5475;Ferrero 10;02:17:45;00:44:33;00:24:22;00:23:01;00:18:39;00:27:10;1;Jagiełło Edyta;2;Szymańska Aleksandra;3;Chmielewski Piotr;4;Chmielewski Robert;5;Chmielewska Wioleta
3255;705;Laguny z finansów;02:17:47;00:34:47;00:19:31;00:27:49;00:28:36;00:27:04;1;Broniarz Paulina;2;Strupiechowska Marlena;3;Tarkowska Katarzyna;4;Banat Alina;5;Kwapisiewicz Sylwia
3256;3044;ForLogistic.com;02:17:47;00:33:17;00:30:59;00:20:59;00:21:07;00:31:25;1;Lis Albert;2;Mikołajczyk Zlatina;3;Gasiul Damian;4;Bohoriev Anatolii;5;Woźniak Maciej
3257;3383;PreZero Recyklersi;02:17:47;00:48:02;00:24:30;00:24:48;00:20:30;00:19:57;1;Krasicka Adrianna;2;Majerczak Kamil;3;Dudziak Michał;4;Grzelczyk Krzysztof;5;Gierczak Marek
3258;4182;Team Jerome;02:17:47;00:21:10;00:27:29;00:24:40;00:39:42;00:24:46;1;Lemparty Mariusz;2;Kulik-Kiernoz Katarzyna;3;Lafuite Jerome;4;Czemerska Anna;5;Pykacz Aleksandra
3259;794;HR Beat;02:17:48;00:26:15;00:25:10;00:30:15;00:28:00;00:28:08;1;Styperek Przemysław;2;Murawski Marcin;3;Sołowiński Adrian;4;Mierzbiczak Karina;5;Piotrowski Bartłomiej
3260;4054;Stretto Team STRACIŁEM WĘCH;02:17:49;00:17:51;00:27:55;00:34:52;00:30:43;00:26:28;1;Kmieć Gerard;2;Cichocki Andrzej;3;Ciesielska Renata;4;Kwiatkowski Jakub;5;Góral Grzegorz
macbook-piotr:talk_with_data fulmanp$ cat data.csv | grep Forecast
2611;3405;Run, Forecast, run!;02:07:57;00:16:05;00:17:51;00:48:27;00:21:53;00:23:41;1;Błasiak Jacek;2;Wzgarda Marcin;3;Skałban Anna;4;Małgorzaciak Liliana;5;Nowak Paweł

macbook-piotr:talk_with_data fulmanp$ python3 htm2csv.py

Skip file pbr_2021_04700.htm because of its size 0B

Skip file pbr_2021_05350.htm because of its size 0B

Skip file pbr_2021_05150.htm because of its size 0B

Skip file pbr_2021_04850.htm because of its size 0B

Skip file .DS_Store as non-HTML file

Skip file pbr_2021_05050.htm because of its size 0B

Skip file pbr_2021_04950.htm because of its size 0B

Skip file pbr_2021_05250.htm because of its size 0B

Skip file pbr_2021_05300.htm because of its size 0B

Skip file pbr_2021_04750.htm because of its size 0B

Skip file pbr_2021_05100.htm because of its size 0B

Skip file pbr_2021_04800.htm because of its size 0B

Skip file pbr_2021_05400.htm because of its size 0B

Skip file pbr_2021_05000.htm because of its size 0B

Skip file pbr_2021_04900.htm because of its size 0B

Skip file pbr_2021_05200.htm because of its size 0B

macbook-piotr:talk_with_data fulmanp$ head data.csv

3251;910;Arjo Finansjera;02:17:45;00:21:56;00:28:03;00:30:59;00:24:49;00:31:58;1;Morkowski Maciej;2;Półrolniczak Jarosław;3;Juśkiewicz Sylwia;4;Świercz Marta;5;Łechtańska Klaudia

3252;2711;KIUSIU;02:17:45;00:37:27;00:18:20;00:24:15;00:26:00;00:31:43;1;Górska Weronika;2;Łysowski Damian;3;Kitowski Kamil;4;Kowalczyk Maria;5;Sprawka Agnieszka

3253;3413;Zabiegani;02:17:45;00:26:50;00:28:32;00:26:14;00:29:00;00:27:09;1;Michałek Daniel;2;Tok Monika;3;Kostrzewska Agata;4;Ryczek Małgorzata;5;Radkowska Magdalena

3254;5475;Ferrero 10;02:17:45;00:44:33;00:24:22;00:23:01;00:18:39;00:27:10;1;Jagiełło Edyta;2;Szymańska Aleksandra;3;Chmielewski Piotr;4;Chmielewski Robert;5;Chmielewska Wioleta

3255;705;Laguny z finansów;02:17:47;00:34:47;00:19:31;00:27:49;00:28:36;00:27:04;1;Broniarz Paulina;2;Strupiechowska Marlena;3;Tarkowska Katarzyna;4;Banat Alina;5;Kwapisiewicz Sylwia

3256;3044;ForLogistic.com;02:17:47;00:33:17;00:30:59;00:20:59;00:21:07;00:31:25;1;Lis Albert;2;Mikołajczyk Zlatina;3;Gasiul Damian;4;Bohoriev Anatolii;5;Woźniak Maciej

3257;3383;PreZero Recyklersi;02:17:47;00:48:02;00:24:30;00:24:48;00:20:30;00:19:57;1;Krasicka Adrianna;2;Majerczak Kamil;3;Dudziak Michał;4;Grzelczyk Krzysztof;5;Gierczak Marek

3258;4182;Team Jerome;02:17:47;00:21:10;00:27:29;00:24:40;00:39:42;00:24:46;1;Lemparty Mariusz;2;Kulik-Kiernoz Katarzyna;3;Lafuite Jerome;4;Czemerska Anna;5;Pykacz Aleksandra

3259;794;HR Beat;02:17:48;00:26:15;00:25:10;00:30:15;00:28:00;00:28:08;1;Styperek Przemysław;2;Murawski Marcin;3;Sołowiński Adrian;4;Mierzbiczak Karina;5;Piotrowski Bartłomiej

3260;4054;Stretto Team STRACIŁEM WĘCH;02:17:49;00:17:51;00:27:55;00:34:52;00:30:43;00:26:28;1;Kmieć Gerard;2;Cichocki Andrzej;3;Ciesielska Renata;4;Kwiatkowski Jakub;5;Góral Grzegorz

macbook-piotr:talk_with_data fulmanp$ cat data.csv | grep Forecast

2611;3405;Run, Forecast, run!;02:07:57;00:16:05;00:17:51;00:48:27;00:21:53;00:23:41;1;Błasiak Jacek;2;Wzgarda Marcin;3;Skałban Anna;4;Małgorzaciak Liliana;5;Nowak Paweł

In case of any problems, you can download my data.csv file.

Upload new csv file to Google drive and again try to load data. Now you encounter similar problem but in line 2168:

ParserError: Error tokenizing data. C error: Expected 19 fields in line 2168, saw 20

1	ParserError: Error tokenizing data. C error: Expected 19 fields in line 2168, saw 20

Let's check what is wrong in this case.

macbook-piotr:talk_with_data fulmanp$ sed -n '2168p' data.csv 
3818;132;ToŁosieSuperktosieCzyNie?;02:33:48;00:19:55;00:23:34;00:41:06;00:29:46;00:39:27;1;Navratil Jan;2;Gbur Daniel;3;Szleger Michał;4;Kukucz-Marszalkowska Anna;5;Guja Natalia

1 2	macbook-piotr:talk_with_data fulmanp$ sed -n '2168p' data.csv 3818;132;ToŁosieSuperktosieCzyNie?;02:33:48;00:19:55;00:23:34;00:41:06;00:29:46;00:39:27;1;Navratil Jan;2;Gbur Daniel;3;Szleger Michał;4;Kukucz-Marszalkowska Anna;5;Guja Natalia

Hmm... this line looks good. I will try to load less data

import pandas as pd 
df=pd.read_csv('/content/drive/MyDrive/pbr_2021/data.csv', sep=";", header = None, nrows = 2167)
df

import pandas as pd

df=pd.read_csv('/content/drive/MyDrive/pbr_2021/data.csv', sep=";", header = None, nrows = 2167)

As you can see, last loaded row is for team AssecoBS 4, but when you check this line in terminal, you will see something different:

macbook-piotr:talk_with_data fulmanp$ sed -n '2167p' data.csv 
3817;3918;Simply the best;02:33:46;00:27:55;00:33:56;00:24:27;00:34:36;00:32:52;1;Wierzbicka Emilia;2;Stojak Małgorzata;3;Bareja Agnieszka;4;Buczak Marta;5;Rumin Anna

1 2	macbook-piotr:talk_with_data fulmanp$ sed -n '2167p' data.csv 3817;3918;Simply the best;02:33:46;00:27:55;00:33:56;00:24:27;00:34:36;00:32:52;1;Wierzbicka Emilia;2;Stojak Małgorzata;3;Bareja Agnieszka;4;Buczak Marta;5;Rumin Anna

I have no idea, why row number reported in Google Colab is different than line number in csv file. It's strange but you have to find correct line. I will use grep with -n option too find line number of last correctly loaded line:

macbook-piotr:talk_with_data fulmanp$ cat data.csv | grep -n 'AssecoBS 4'
2809:659;1677;AssecoBS 4;01:44:07;00:21:09;00:15:54;00:21:54;00:19:09;00:26:01;1;Depta Konrad;2;Domeracki Damian;3;Maj Błażej;4;Skowroński Michał;5;Żeligowska Michalina

1 2	macbook-piotr:talk_with_data fulmanp$ cat data.csv \| grep -n 'AssecoBS 4' 2809:659;1677;AssecoBS 4;01:44:07;00:21:09;00:15:54;00:21:54;00:19:09;00:26:01;1;Depta Konrad;2;Domeracki Damian;3;Maj Błażej;4;Skowroński Michał;5;Żeligowska Michalina

Ok, so you have to check line number 2810:

macbook-piotr:talk_with_data fulmanp$ sed -n '2810p' data.csv
660;839;Running is my life ;-);01:44:09;00:19:58;00:19:54;00:20:03;00:21:27;00:22:47;1;Foremski Dawid;2;Michalik Krzysztof;3;Reducha Radosław;4;Klasik-Koscielny Sabina;5;Lincoln Thomas

1 2	macbook-piotr:talk_with_data fulmanp$ sed -n '2810p' data.csv 660;839;Running is my life ;-);01:44:09;00:19:58;00:19:54;00:20:03;00:21:27;00:22:47;1;Foremski Dawid;2;Michalik Krzysztof;3;Reducha Radosław;4;Klasik-Koscielny Sabina;5;Lincoln Thomas

And you have the culprit: this time team's name is Running is my life ;-) with ; character inside!

Once again you have to change field separator. This time I will use ^ character.

Anyway, I was curious this line number differences, so I loaded csv with commas into spredsheet and I have found this:

What?!

The problem starts in row number 418:

1918;5211;"Znakomicie, wszyscy!”;01:59:50;00:25:11;00:24:55;00:25:02;00:19:09;00:25:33;1;Ewa Czaja;2;Ewa Lipka;3;Parkita Justyna;4;Maciej Stachura;5;Michał Mendala

1	1918;5211;"Znakomicie, wszyscy!”;01:59:50;00:25:11;00:24:55;00:25:02;00:19:09;00:25:33;1;Ewa Czaja;2;Ewa Lipka;3;Parkita Justyna;4;Maciej Stachura;5;Michał Mendala

and ends in row number 1060:

3660;2102;INGowe "pędziwiatry";02:27:43;00:22:04;00:46:01;00:25:25;00:23:32;00:30:41;1;Szymonowicz Nikola;2;Fabian Angelika;3;Erdmański Igor;4;Kowolik Adrianna;5;Juszczyk Bartosz

1	3660;2102;INGowe "pędziwiatry";02:27:43;00:22:04;00:46:01;00:25:25;00:23:32;00:30:41;1;Szymonowicz Nikola;2;Fabian Angelika;3;Erdmański Igor;4;Kowolik Adrianna;5;Juszczyk Bartosz

Making fast computations:

2810 - 2167 = 643
1060 - 418 = 642

1 2	2810 - 2167 = 643 1060 - 418 = 642

If you look carefully into team's name in row 418: "Znakomicie, wszyscy!” you will see that two different chars are used to delimit string: opening is " while closing is ”, so this is why all next rows till row number 1060 is treated as one long string.

To fix this, you have to manually replace ” with " to the form:

1918^5211^"Znakomicie, wszyscy!"^01:59:50^00:25:11^00:24:55^00:25:02^00:19:09^00:25:33^1^Ewa Czaja^2^Ewa Lipka^3^Parkita Justyna^4^Maciej Stachura^5^Michał Mendala

1	1918^5211^"Znakomicie, wszyscy!"^01:59:50^00:25:11^00:24:55^00:25:02^00:19:09^00:25:33^1^Ewa Czaja^2^Ewa Lipka^3^Parkita Justyna^4^Maciej Stachura^5^Michał Mendala

Now you can upload fixed file to Google drive and check if it loads:

import pandas as pd 
df=pd.read_csv('/content/drive/MyDrive/pbr_2021/data.csv', sep="^", header = None)
df

import pandas as pd

df=pd.read_csv('/content/drive/MyDrive/pbr_2021/data.csv', sep="^", header = None)

Hooray! 4663 rows are loaded successfully which agrees with total number of rows in this file:

macbook-piotr:talk_with_data fulmanp$ wc data.csv 
    4663   36473  812208 data.csv

1 2	macbook-piotr:talk_with_data fulmanp$ wc data.csv 4663 36473 812208 data.csv

Very basic analysis

Display some informations about data types used to represent the data:

df.info()

1

df.info()
Extract column at index 3 which is a column with total teams' time:

df[3]

1

df[3]
Convert data to timedelta type to be able to perform time calculations:

s = pd.to_timedelta(df[3]) s

1
2

s = pd.to_timedelta(df[3])
s
Convert timedelta to number of seconds:

ts = s.dt.total_seconds() ts

1
2

ts = s.dt.total_seconds()
ts
Do some simple histograms:

ts.hist()

1

ts.hist()

ts.hist(bins=50)

1

ts.hist(bins=50)

ts.hist(bins=500)

1

ts.hist(bins=500)
Calculate mean:

secondsMean = ts.mean() secondsMean

1
2

secondsMean = ts.mean()
secondsMean
Convert seconds to time delta:

pd.to_timedelta(secondsMean, unit='s')

1

pd.to_timedelta(secondsMean, unit='s')
Calculate mode:

ts.mode()

1

ts.mode()

secondsMode = ts.mode()[0] pd.to_timedelta(secondsMode, unit='s')

1
2

secondsMode = ts.mode()[0]
pd.to_timedelta(secondsMode, unit='s')