Parsing Common Crawl in 2 plain scripts in python

TLDR

  • Go to common crawl website;
  • Download the index (~200 GB);
  • Choose domains in your country / language (now they also have language detection =) );
  • Download only plain-text files you need (I would suspect that since the websites are crawled alphabetically, single-language websites will cluster into several hundreds or thousands large files);
  • Profit;
parse_cc_index.py
process_wet_files.py

Motivation

In case you have not seen the previous post in the series about mining Wikipedia for NLP corpus in 4 commands in Python, check it out.

Common Crawl structure

When I began thinking about implementing something like this, I thought it would be prohibitively difficult and / or expensive — after all you do not see CC pop up very often in the research / blogs / news nowadays (I guess this is partially due to the copyright or the fact that biggest NLP research is sponsored by Google / FAIR which can afford collecting their own data).

  • WARC files with HTML and all (~68TB total, ~71k files);
  • WET files only with plain-text (~7TB, ~71k files)
  • Add .wet before the suffix .gz;
  • Literally that is it!
git clone https://github.com/erroneousboat/warc3.git

Data Scientist

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store