Parsing Wikipedia in 4 simple commands for plain NLP corpus retrieval

Nowadays this is as easy as:

git clone\ncd wikiextractorwget -o ../data/wiki/ --no-templates --processes 8 ../data/ruwiki-latest-pages-articles.xml.bz2

following it up with this plain script [5] (the script is published via a gist, it is kept functional and concise for extendability)


This will produce a plain .csv file with your corpus.


  • can be tweaked to obtain your language, for details see [4];
  • For wikiextractor params simply see its man page (I suspect that even its official docs are out of date);

The post-processing script turns Wikipedia files into a table like this:

article_uuid is pseudo-unique and sentence order is supposed to be preserved.


Arguably, the state of current ML instruments enables practitioners [8] to build and deliver scalable NLP pipelines within days. The problem arises only if you do not have a trust-worthy public dataset / pre-trained embeddings / language model. This article aims to make this a bit eae} a bit by illustrating that preparing Wikipedia corpus (the most common corpus for word vector training in NLP) is real in just a couple of hours. So, if you spend 2 days to build a model, why spend much more time engineering some crawler to get the data?)

High level script walk-through

The wikiExtractor tool saves Wikipedia articles in the plain-text format separated into <doc> blocks. This can be easily leveraged using the following logic:

  • Obtain the list of all output files;
  • Split the files into articles;
  • Remove any remaining HTML tags and special characters;
  • Use nltk.sent_tokenize to split into sentences;
  • To avoid code bulk, we can keep our code simple by assigning a uuid to each article;

As text pre-processing I just used (change it to fit your needs) this:

  • Removing non-alphanumeric characters;
  • Removing stop words;

I have the dataset, what to do next?

Usually people use one of the following instruments for the most common applied NLP task — embedding a word:

  • Word vectors / embeddings [6];
  • Some internal states of CNNs / RNNs pre-trained on some tasks like fake sentence detection / language modeling / classification [7];
  • A combination of the above;

It has also been shown multiple times [9], that plain averaged (with minor details, we will omit for now) word embeddings are also a great baseline for sentence / phrase embedding task.

Some other common use cases

  • Using random sentences from Wikipedia for negatives mining while using triplet loss;
  • Training sentence encoders via fake sentence detection [10];

Some charts for Russian Wikipedia

Sentence length distribution in Russian Wikipedia

In natural terms (truncated)

_Log10 _


  1. Fast-text word vectors pre-trained on Wikipedia;
  2. Fast-text and Word2Vec models for the Russian language;
  3. Awesome wiki extractor library for python;
  4. Official Wikipedia dumps page;
  5. Our post-processing script;
  6. Seminal word embeddings papers: Word2Vec, Fast-Text, further tuning;
  7. Some of the current SOTA CNN-based approaches: InferSent / Generative pre-training of CNNs / ULMFiT /Deep contextualized word representations (Elmo);
  8. Imagenet moment in NLP?
  9. Sentence embedding baselines 1, 2, 3, 4;
  10. Fake Sentence Detection as a Training Task for Sentence Encoding;

Originally published at on October 3, 2018.

Data Scientist

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store