Parsing Wikipedia in 4 simple commands for plain NLP corpus retrieval
Nowadays this is as easy as:
git clone https://github.com/attardi/wikiextractor.git\ncd wikiextractorwget http://dumps.wikimedia.org/ruwiki/latest/ruwiki-latest-pages-articles.xml.bz2python3 WikiExtractor.py -o ../data/wiki/ --no-templates --processes 8 ../data/ruwiki-latest-pages-articles.xml.bz2
following it up with this plain script  (the script is published via a gist, it is kept functional and concise for extendability)
This will produce a plain
.csv file with your corpus.
http://dumps.wikimedia.org/ruwiki/latest/ruwiki-latest-pages-articles.xml.bz2can be tweaked to obtain your language, for details see ;
wikiextractorparams simply see its man page (I suspect that even its official docs are out of date);
The post-processing script turns Wikipedia files into a table like this:
article_uuid is pseudo-unique and sentence order is supposed to be preserved.
Arguably, the state of current ML instruments enables practitioners  to build and deliver scalable NLP pipelines within days. The problem arises only if you do not have a trust-worthy public dataset / pre-trained embeddings / language model. This article aims to make this a bit eae} a bit by illustrating that preparing Wikipedia corpus (the most common corpus for word vector training in NLP) is real in just a couple of hours. So, if you spend 2 days to build a model, why spend much more time engineering some crawler to get the data?)
High level script walk-through
wikiExtractor tool saves Wikipedia articles in the plain-text format separated into
<doc> blocks. This can be easily leveraged using the following logic:
- Obtain the list of all output files;
- Split the files into articles;
- Remove any remaining HTML tags and special characters;
nltk.sent_tokenizeto split into sentences;
- To avoid code bulk, we can keep our code simple by assigning a
uuidto each article;
As text pre-processing I just used (change it to fit your needs) this:
- Removing non-alphanumeric characters;
- Removing stop words;
I have the dataset, what to do next?
Usually people use one of the following instruments for the most common applied NLP task — embedding a word:
- Word vectors / embeddings ;
- Some internal states of CNNs / RNNs pre-trained on some tasks like fake sentence detection / language modeling / classification ;
- A combination of the above;
It has also been shown multiple times , that plain averaged (with minor details, we will omit for now) word embeddings are also a great baseline for sentence / phrase embedding task.
Some other common use cases
- Using random sentences from Wikipedia for negatives mining while using triplet loss;
- Training sentence encoders via fake sentence detection ;
Some charts for Russian Wikipedia
Sentence length distribution in Russian Wikipedia
In natural terms (truncated)
- Fast-text word vectors pre-trained on Wikipedia;
- Fast-text and Word2Vec models for the Russian language;
- Awesome wiki extractor library for python;
- Official Wikipedia dumps page;
- Our post-processing script;
- Seminal word embeddings papers: Word2Vec, Fast-Text, further tuning;
- Some of the current SOTA CNN-based approaches: InferSent / Generative pre-training of CNNs / ULMFiT /Deep contextualized word representations (Elmo);
- Imagenet moment in NLP?
- Sentence embedding baselines 1, 2, 3, 4;
- Fake Sentence Detection as a Training Task for Sentence Encoding;
Originally published at spark-in.me on October 3, 2018.