Parsing Common Crawl in 4 plain scripts in python

Parsing Common Crawl in 4 plain scripts in python (not 2)

After starting the CC mini-project in our last post, we ran into several challenges, all of which we more or less resolved (or avoided altogether).

In the end, the full pipeline looks like (see detailed explanations below) this:

python3 parse_cc_index.py
python3 save_cc_indexes.py
python3 prepare_wet_indexes.py
python3 process_wet_files.py

New challenges:

How we solved them:

As usual new scripts are available here.

What the end results looks like

Yes, you can drop the url / domain / tld column.
But disk space is essentially free nowadays )

You can see that there are a lot of short texts from website navigation, but they are easily filtered by length.

Sentence length distribution on a small sample

I have no intention whatsoever to check in scientifically, but also my rough guess is that 1–5% websites are dedication to pornography / prostitution (just a guess by looking at some random data samples).

Some more insights into the CC structure

This is table showing how many unique URLs with Russian texts we had in our final table contained in how many unique WET files in which top level domain zone. You can see that all popular domain zones are distributed evenly across all of the 71,520 WET files.

Unique Russian URLs

This is a distribution of the number of times a given WET file was contained in the above URL list

We can see that there are several distinct peaks, but obviously you should start your CC crawl processing with the more frequent files. My guess is that they will contain much more relevant data, i.e. there is some underlying logic to the way the WET/WARC files are ordered domain-zone wise, but this is not apparent to me now.

Overall pipeline walk-through

python3 parse_cc_index.py # (1)
python3 save_cc_indexes.py # (2)
python3 prepare_wet_indexes.py # (3)
python3 process_wet_files.py # (4)

Pipeline explanation:

Further optimization, processing speed

Obviously it will differ hugely depending on your Internet connection and CPU power, but in our case:

  • Downloading the whole index took approximately 1 week with average speed around 300–500 kb/s. It was done in background in the office with no hurry;
  • Downloading and processing ~1000 WET files (with largest Russian content ratio) took ~15 hours and produced ~25–30 GB of texts on 3 physical cores of my Intel® Core™ i7-6800K CPU @ 3.40GHz. So I guess it is safe to say that 1 day roughly equals to a 40-50 GB corpus on half of my home PC. Also for this task - the bandwidth was almost a non-issue in my case;

How the script can be improved:

  • Download files in advance and / or add a some queue to collect the data and a separate queue to post-process the data. This will help you utilize your bandwidth and CPU power at 100% of its capacity always. I personally decided not to invest in this as it is easier just to wait a bit, also my CPU will not just suddenly grow in size;

References

Originally published at spark-in.me on October 8, 2018.

Data Scientist

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store