**Deep FastText. You can also use this trick with transformers**

Comparing complex NLP models for complex languages on a set of real tasks

Does it make sense to pre-train transformer-based models? Can you do better than BPE? Which architecture is better for which task?

Published in

Towards Data Science

13 min readMar 7, 2019

This will be one of those articles, where we do not publish “plug and play” code, but we ramble endlessly about what we tried and failed (or maybe not?) and most importantly how this fits onto a broader image.

Over the course of last 3–6 months we have tried various models for basic NLP tasks (like classification, sequence-to-sequence modeling, machine comprehension) for one morphologically rich language — Russian. This is kind of cool, because we can inherit the vast modern literature and codebase of modern NLP methods (RNNs, LSTMs, embeddings, PyTorch, FastText, LASER to name a few).

But there are usually 3 pains:

Russian is much more complex than English and morphologically rich;
There are very scarce public benchmarks / datasets / corpuses;
NLP research in Russian is mostly non-existent (this primarily means applied benchmarks by real people). Afaik, state-of-the-art in Russian-English NMT is held by Chinese guys;

These are our conclusions so far:

Easy tasks — Despite the widely advertised improvements made by Transformers on a plethora of NLP tasks (I would consider 10 class classification on a academic dataset with 5–10k rows and several classes an “easy task”), we failed to reap REAL benefits of using transformers on such “easy tasks” given the computation cost and added complexity and inference time. Proper RNN / CNN based models with attention quickly become capped by the bias in the data — not by task complexity;
There was one case when on a dialogue intent 30 class classification task the “baseline” model gave 51% top1 accuracy, whereas the transformer gave around 57%. But on further investigation and cleaning — after we removed bias from data and reduced the number of categories to ~15–20, this boost evaporated, i.e. transformer was better in fitting to biased data.
Also what is remarkable — generative pre-training (BERT) in our setting behaves actually worse than just initializing an embedding bag layer with FastText n-gram vectors (given time, effort and computational resources required — it is a no-brainer you should not pre-train such models yourself).
Btw, such “easy” tasks include in my opinion mostly: classification, tagging, intent classification, search classification, POS, NER (provided of course you have a proper dataset).
Sequence-to-sequence tasks — I guess you should put NMT here, but it is a thing in itself, we did not even tackle. On a semi-difficult (if a fat LSTM converges for a couple of days — it’s something, right?) task of typo correction in human names in the CIS states (do not dismiss it, read the post, it is not that easy as it sounds) — we just did not have enough patience to wait until transformers would converge. I would estimate they converged ~5x slower. We did not any sort of pre-training there, because the domain is very different;
Really difficult tasks. Here for Russian I could only dig out the so-called SberSQUAD. Note that on the SQUAD itself the transformers now have the best EM (exact match) and exceed human performance with scores floating around 90% EM. The state-of-the-art model custom built for this particular task gives around 60% EM according to DeepPavlov (of course we do not know - it is train or val and whether validation was handled correctly). But here it becomes interesting:
A baseline back of the envelope model built it one day could achieve around 15% EM;
I could fit a transformer with embedding bag (initialized by FastText) up to 37-40% EM with a high learning rate of 1e-3 (45 hours to train on a single 1080 Ti);
When we added the generative pre-training, the model started converging several times faster, but no matter how we tried — we could not beat 30% EM - the model just overfitted eventually - so pre-training with our Embedding bag approach works, but just plain FastText init is better in the end;
Inference benchmarks — just benchmarking decently-sized transformers (300–400 hidden-size, 12 attention heads) on CPU inferences — gives around a 10x inference time as a rule of thumb on the same data compared to LSTMs w/o speed optimizations (just padding).

Naturally take these with a grain of salt. And yes, probably all of this just means one of the following things:

We made some bugs in code — though we took our time and checked everything thoroughly;
Since we mostly used 1m of 150 dimensional embeddings (to fit into memory) with a transformer with hidden size of 400, probably size matters;
Most likely — pre-training with the equivalent of 99 days * 1080Ti also matters;
Also probably due to imbalanced model (1m ngrams, shallow transformer) and lack of attention layer in our embedding bag — our model is essentially bottlenecked and / or overfits somehow on pre-train;

Latest tendencies in NLP models

You may not know that there are 2 large groups of “complicated” languages — agglutinative and fusional languages, English itself being mostly analytic language (in English most sense in conveyed by way of helper words (particles, prepositions, etc)).

I am no linguistic expert, but Russian has a lot of agglutinative and fusional features. We have inflection, conjugation, morphemes and a lot of ways to mix all of this stuff!

So, on a very basic level every word in Russian can be split like this:

And yes, especially in verbs — for popular roots there exist verbs with almost all popular prefixes and they … surprise-surprise have very different meanings. If you know German — you will instantly understand me.

And of course to add to word formation — we have inflection and conjugation:

Word representations in modern NLP

Word vectors revolutionized the NLP scene back when the original Word2Vec paper came out. But they actually became usable for languages with morphology only when FastText and its awesome implementation were introduced.

If you are not familiar with Word2Vec — it essentially just learns say N (usually 300) dimensional word vectors from raw text, namely from words that surround the word in question. FastText takes this idea a bit further and learns subword vectors instead of word vectors and a word is just a weighted average of its subwords. If you are more into a technical aspect of this — nowadays you do not even need custom engineering to implement a basic w2v model — it can easily be done with embedding layers in modern deep learning frameworks.

But all of this is fine and dandy in theory. In reality the closest word to cat is … dog. Coffee is related to cup as coffee is a beverage often drunk in a cup, but coffee is not similar to cup in that coffee is a beverage and cup is a container. Because these vectors DO NOT CAPTURE MEANING. They capture CONTEXT. If course you can use approaches like this one, where you essentially train something similar to a auto-encoder or a GAN to encode the dictionary word definition into a vectors. But this seems a bit high maintenance for most cases.

Also prior to FastText doing anything practical and deployable with Word2Vec was a pain for languages like Russian because:

For a simple task you may have a vocabulary of 50-100k;
For a real production tasks — you vocabulary most likely will be within a 1m-10m range;
Whole Russian internet contains ~100m unique words;

I did not do any benchmarks for English, but I would guess that the same figures would be around 5x-10x lower, which nevertheless does not make certain tasks tractable. This particular vocabulary problem is kind of alleviated in the mainstream NLP literature by using the so-called byte pair encoding (BPE), which essentially learns how to split words efficiently into chunks.

Theoretically, BPE should be able to work nicely with agglutinative languages because morphemes do not change. The only problem is that Russian has a lot of fusional features as well. It makes sense that with NMT BPE would not work well, but with more down-to-earth tasks?

I tried testing BPE on a variety of tasks (I usually tried several vocabulary sizes like 30k / 100k / 500k) — and the best I had was a small decrease in accuracy, like 1–3pp, compared to char-level approaches and embedding bag approaches. I asked around in the community — and surprise-surprise I am not alone in observation that BPE kind of not works for Russian. Of course, maybe one should also balance the model weight in this case, but in my case the only 2 criteria were: (i) model should be fast during runtime (ii) model should converge fast during training. With modern GPUs and RAM there is little difference between having 50k and 250k vocabulary, unless you are doing NMT or language modelling, of course.

What is also puzzling — I have rarely seen mainstream papers apart from FastText of course that try to build word embedding solutions that would somehow take rich morphology into consideration. This Russian-Chinese NMT paper is an exception — but they predict only endings separately.

Attention, transformers, generative pre-training

All of these are very important concepts.

But I will not bother copy pasting the same text and illustrations and will just redirect you to the best pages I know of explaining these concepts:

How transformers work — annotated transformer and illustrated transformer;
How attention works and different types of attention;
How sequence to sequence model work — the annotated encoder decoder;
You do not need RNNs;

In essence now to solve the applied NLP tasks, you have the following options to choose from:

Embedding layer to represent words or chars is a must have;
Vocabulary: char-level / Fixed / BPE-based approaches / Embedding bag / FastText-based approaches;
Embedding initialization: Word2Vec or Glove / FastText /a mixture of the above;
The model itself:
(1) TCN / RNN + simple self-attention;
(2) Encoder-decoder based sequence to sequence model;
(3) Transformer;

Just select the correct combination and you are good to go!
Our task was mainly to understand how transformers (3) fit into this ecosystem — BPE or not, how to initialize the embeddings, how benchmark tasks are solved etc etc

Our solution to word representation problem

**Our approach to word representations for Russian in a nutshell**

So, now you see that working with a morphologically rich language is a chore. By working I mean producing models that you are sure will generalize well to modestly unknown situations.

But can’t we combine two of the best instruments we have? FastText + deep learning models? Turns out we can and this can be easily done using an EmbeddingBag layer in PyTorch for example. You can build a small model with attention and a mixture or words / ngrams / chars — but it most likely will work slower than low-level C++ implementation.

This approach works fantastically with plain models like TCNs / RNNs / CNNs and provides generalization and totally eliminates OOV cases, no more UNK tokens or any similar clunky solutions!

But can we generalize this to a transformer model pre-trained on large corpus of text?

There are only 3 fundamental problems with this approach:

Adding attention to an embedding bag layer is a problem not likely to be solved;
To guarantee generalization you need at least around 500k-1000k ngrams;
You face technical issues — language modeling labels are not usable anymore for LM task (you have to use the model’s embeddings as ground truth) + using standard DataParallel wrappers becomes slow in PyTorch because of large embedding layer, to achieve true parallelism probably you have to use DistributedDataParallel;
You can get away with just using Embedding Bags on one GPU and then dividing the resulting bagged tensors into the right shape, but when you have a multi-GPU setup it becomes much more difficult, because instead of tokens each token becomes a padded sequence;

***Typical vocabulary coverage (% of ngrams covered in a word) in case your ngram set is properly chosen. With 1m 2–6 grams you can cover even 100m dictionaries.***

Is Transformer / BERT / GPT better for you?

TLDR — unless you can use a pre-trained one — then most likely not, unless this is the only architecture powerful enough to handle your task.

Seq2seq benchmark

Please refer to this post.
TLDR — on a complicated sequence to sequence benchmark the transformer based models seems to have converged much slower.

Generative pre-training done 10–100x faster?

We will not release our whole pre-training code (and what is more important — data scraping, pre-processing and our Dataset for obvious reasons), but you can just stick these classes here into a huggingface model, to:

Start with pre-trained BPE vectors;
Start with old version of Embedding bag (which works better, lol);
Start with padded version of Embedding bag (to easily use multi-GPU wrappers);

In a nutshell, what they say about US$50–100k computational resource required to pre-train a transformer seems totally unfeasible. Can we make it faster somehow for common folk like us?

(Me from future: Yeah, do not pre-train, just train with Embedding bag initilized with FastText!)

In a purely American / Silicon Valley elitism fashion, papers like BERT do not do any poor man’s ablation tests like what would happen to convergence if you initialized your model with FastText?

But anyway, how can we speed up convergence? Let’s assume that pre-training takes 400 days on one 1080Ti and let’s work from there:

Starting from pre-trained vectors / n-grams — maybe x10 faster? ;
Not using a large softmax layer (even if it is linked to your embedding layer) but using cosine loss or something inspired by this. By the way these guys also start from FastText — x2 faster?;
A lighter model — x4 faster?;
Using an embedding bag layer that works well for Russian language;

All in all with all of these “optimizations” it seems feasible to be able to pre-train / tune a transformer in a week or so) And it is real, the only problem is that the actual pre-trained model did not really seem to beat a model just initilized with FastText.

Pre-traininig experiments

* We used 2 GPU setup for each model, but in the end we found out that the newer version of the embedding bag was roughly 25% slower + due to large embedding bag size;
** Classification task from BERT paper;

Other “failed” approached we tested:

All models trained from scratch converged much slower and plateaued quickly;
All BPE based models initialized with FastText converged much slower and plateaued quickly around 65% sequential task accuracy;
FastText + embedding freeze — minus 5pp sequential task accuracy;

Actually trying out the pre-trained model

This was by far the most disappointing part of this whole exercise.

As mentioned in the intro — any sort of transformer (from scratch, pre-trained, from FastText) did not help in our “easy” classification task on a complex domain (but FastText was the best).

On a challenging SberSQUAD task, we has the following results:

A FastText initialized model trained with a high lr of 1e-3 to about 37%-40% EM. Probably more can be achieved with LR decay. Remarkably model diverged frequently and seemed to “jump” on each restart;
When we tried the pre-trained model with high lr of 1e-3 it trained much faster than FastText, but overfitted heavily;
If we started with lower lr somewhere around 5e-4 - then the pre-trained model traned also much faster than FastText, but overfitted around 30% EM;

I suppose if we invested x10 resources into actually tuning the hyper-parameters, then we would achieve a higher result. But you see — generative pre-training IS NOT A SILVER BULLET. especially for non generative tasks.

On any SANE task — conventional RNNs / CNNs / TCNs — blow transformers out of the water.

**Top performance of FastText initialized transformer**

**Low learning rate, pre-train vs. fast-text**

Embedding bag code

Just use our code, stick it here and add water. No, we are not stupid, and we use version control.

Improvements or how to make our idea mainstream

People from OpenAI, Google and FAIR, if you are reading this, you can do the following:

Solve attention problem within the Embedding bag layer;
Add more compute to train a larger transformer with larger embedding bags;
Test such generative pre-training on other benchmarks for morphologically rich languages if you have them;
Invest time and effort into proper sub-word splitting techniques and pass bags corresponding to different kinds of sub-words separately;

References

Popular tools

Temporal convolutional network;
Popular BPE implementation;
Auto-Encoding Dictionary Definitions into Consistent Word Embeddings;
PyTorch BERT by Huggingface;
Improved English to Russian Translation by Neural Suffix Prediction;
BERT pre-training;

Corpuses / datasets / benchmarks in Russian:

Russian SQUAD and sentiment analysis datasets;
Mining for a large web corpus in Russian;
My posts on parsing Wikipedia, and parsing Common Crawl;
Prepared and deduplicated Common Crawl texts;
Downsides of using common crawl to train sentence encoders;
DeepPavlov Russian SQUAD;
FastText pre-trained on the largest corpus in Russian;

Simple sentence embedding baselines:

Word embeddings explained:

Original word embedding papers:

Word2Vec — Distributed Representations of Words and Phrasesand their Compositionality;
FastText Enriching Word Vectors with Subword Information

Illustrated state-of-the-art NLP models and approaches:

Attention;
Illustrated transformer;
Annotated transformer;
Annotated encoder decoder;
Plain self-attention in PyTorch;
A couple of notes on TCNs and self-attention;
Training NMT models several times faster w/o a large softmax layer?;

Other links

Russian word parts;
CFT 2018 competition;

Originally published at spark-in.me on March 1, 2019.