Parsing Common Crawl in 4 plain scripts in python (not 2)
After starting the CC mini-project in our last post, we ran into several challenges, all of which we more or less resolved (or avoided altogether).
In the end, the full pipeline looks like (see detailed explanations below) this:
- The amount of data turned out ~10–20x larger than expected;
- The structure of indexes — looks like the CC index is alphabetically ordered, but the domains in the WET / WARC files are not (see some charts below), i.e. the data in question is more or less “uniformly” distributed across the WET files (i.e. there are no huge WET blobs with 100% of Russian sites);
- Concurrency / download speed;
- CPU-bound pre-processing — turned out that the pre-processing turned out to be ~4x more resource-intensive;
How we solved them:
- Well, since such corpus is to be used in combination with other corpuses, there is no need to download terabytes of text, I guess ~100–200 GB of text will suffice;
- This is just how CC works, but this can be mitigated by processing the most “important” WET files first;
- Just order the fattest download speed you can with your ISP and load files in as many threads as you can. Funnily enough, in Moscow this is not a bottleneck at all. Just look the tarriffs available — 500 Mbit/s connections are available for essentially USD10–15 per month (and you would pay USD5–10 for your Internet anyway — so this is not a real sunk cost);
- This is a tougher one — you need a lot of CPU cores. But in my case the output of my pipeline was around ~30–50 GB of text per day on 50% of my home machine (~ 3 physical cores of my Intel® Core™ i7–6800K CPU @ 3.40GHz), which is sufficient, I guess. I underestimated the post-processing, but I underestimated the amount of data even more (texts compress well);
As usual new scripts are available here.
What the end results looks like
Yes, you can drop the url / domain / tld column.
But disk space is essentially free nowadays )
You can see that there are a lot of short texts from website navigation, but they are easily filtered by length.
Sentence length distribution on a small sample
I have no intention whatsoever to check in scientifically, but also my rough guess is that 1–5% websites are dedication to pornography / prostitution (just a guess by looking at some random data samples).
Some more insights into the CC structure
This is table showing how many unique URLs with Russian texts we had in our final table contained in how many unique WET files in which top level domain zone. You can see that all popular domain zones are distributed evenly across all of the 71,520 WET files.
This is a distribution of the number of times a given WET file was contained in the above URL list
We can see that there are several distinct peaks, but obviously you should start your CC crawl processing with the more frequent files. My guess is that they will contain much more relevant data, i.e. there is some underlying logic to the way the WET/WARC files are ordered domain-zone wise, but this is not apparent to me now.
Overall pipeline walk-through
python3 parse_cc_index.py # (1)
python3 save_cc_indexes.py # (2)
python3 prepare_wet_indexes.py # (3)
python3 process_wet_files.py # (4)
- The first script downloads all of the 299 CC indexes and filters our the urls by language;
- (2) just takes these 299 files and saves in 10 feather files for convenience;
- (3) calculates 2 things — a set of unique urls for later use and a list of urls with WET files;
- (4) just goes through WET files one by one (in a multi-processed fashion of course), downloads them, parses and cleans them, divides sentences and save the results into feather files;
Further optimization, processing speed
Obviously it will differ hugely depending on your Internet connection and CPU power, but in our case:
- Downloading the whole index took approximately 1 week with average speed around 300–500 kb/s. It was done in background in the office with no hurry;
- Downloading and processing ~1000 WET files (with largest Russian content ratio) took ~15 hours and produced ~25–30
GBof texts on 3 physical cores of my Intel® Core™ i7-6800K CPU @ 3.40GHz. So I guess it is safe to say that 1 day roughly equals to a 40-50 GB corpus on half of my home PC. Also for this task - the bandwidth was almost a non-issue in my case;
How the script can be improved:
- Download files in advance and / or add a some queue to collect the data and a separate queue to post-process the data. This will help you utilize your bandwidth and CPU power at 100% of its capacity always. I personally decided not to invest in this as it is easier just to wait a bit, also my CPU will not just suddenly grow in size;
- Previous articles in this mini-series:
- Parsing Wikipedia in 4 plain commands in python;
- Previous article about parsing the Common crawl;
- A gist with the scripts;
- A list of useful Common Crawl starter links:
- Getting links to WET files;
- Reanimated python3 WARC file library;
Originally published at spark-in.me on October 8, 2018.