A Pre-trained Model to Restore Punctuation and Capital Letters in 4 Languages

4 min readOct 6, 2021

Originally published at https://habr.com on October 6, 2021.

Working with speech recognition models we often encounter misconceptions among potential customers and users (mostly related to the fact that people have a hard time distinguishing substance over form). People also tend to believe that punctuation marks and spaces are somehow obviously present in spoken speech, when in fact real spoken speech and written speech are entirely different beasts.

Of course you can just start each sentence with a capital letter and put a full stop at the end. But it is preferable to have some relatively simple and universal solution for “restoring” punctuation marks and capital letters in sentences that our speech recognition system generates. And it would be really nice if such a system worked with any texts in general.

For this reason, we would like to share a system that:

Inserts capital letters and basic punctuation marks (dot, comma, hyphen, question mark, exclamation mark, dash for Russian);
Works for 4 languages (Russian, English, German, Spanish) and can be extended;
By design is domain agnostic and is not based on any hard-coded rules;
Has non-trivial metrics and succeeds in the task of improving text readability;

To reiterate — the purpose of such a system is only to improve the readability of the text. It does not add information to the text that did not originally exist.

The Problem and The Solution

Let’s assume that the input is a sentence in lowercase letters without any punctuation, i.e. similar to the outputs of any speech recognition systems. We ideally need a model that makes texts proper, i.e. restores capital letters and punctuation marks. A set of punctuation marks .,-!?- was chosen based on characters most used on average.

Also we embraced the assumption that the model should insert only one symbol after each token (a punctuation mark or a space). This automatically rules out complex punctuation cases (i.e. direct speech). This is an intentional simplification, because the main task is to improve readability as opposed to achieving grammatically ideal text.

The solution had to be universal enough to support several key languages. By design we can easily extend our system to an arbitrary number of languages should need arise.

Initially we envisaged the solution to be small BERT-like model with some classifiers on top. We used internal text corpora for training.

We acquainted ourselves with a solution of a similar problem from here. However, for our purposes we needed:

A lighter model with a more general specialization;
An implementation that does not directly use extrnal APIs and a lot of third-party libraries;

As a result, our solution mostly depends only on PyTorch.

Model Compression

After extensive experiments we eventually settled on the simplest possible architecture, and the final weight of the model was still 520 megabytes.

So we tried to compress the model. The simplest option is of course quantization (particularly a combination of static and dynamic as described here).

As a result, the model was compressed to around 130 megabytes without significant loss of quality. Also we reduced the redundant vocabulary by throwing out tokens for non-used languages. This allowed us to compress the embedding from 120,000 tokens to 75,000 tokens.

Provided that at that moment the model was smaller than 100 megabytes, we decided against investing time in more sophisticated compression techniques (i.e. dropping less used tokens or model factorization). All of the metrics below are calculated with this small quantized model.

Metrics

Contrary to the popular trends we aim to provide as detailed, informative and honest metrics as possible. In this particular case, we used the following datasets for validation:

Validation subsets of our private text corpora (5,000 sentences per language);
Audiobooks, we use the caito dataset, which has texts in all the languages the model was trained on (20,000 random sentences for each language);

We use the following metrics:

WER (word error rate) as a percentage: separately calculated for repunctuation WER_p (both sentences are transformed to lowercase) and for recapitalization WER_c (here we throw out all punctuation marks);
Precision / recall / F1 to check the quality of classification (i) between the space and the punctuation marks mentioned above .,-!?-, and (ii) for the restoration of capital letters - between classes a token of lowercase letters / a token starts with a capital / a token of all caps. Also we provide confusion matrices for visualization;

Results

For the correct and informative metrics calculation, the following transformations were applied to the texts beforehand:

Punctuation characters other than .,-!?- were removed;
Punctuation at the beginning of a sentence was removed;
In case of multiple consecutive punctuation marks we keep only the first one;
For Spanish ¿¡ were discarded from the model predictions, because they aren't in the texts of the books, but in general the model places them as well;

WER

WER_p / WER_c are specified in the cells below. The baseline metrics are calculated for a naive algorithm that starts the sentence with a capital letter and ends it with a full stop.

Examples

Try it Yourself

The model is published in the repository silero-models. And here is a simple snippet for model usage (more detailed examples can be found in colab):

Limitations and Future Work

We had to put a full stop somewhere (pun intended), so the following ideas were left for future work:

Support inputs consisting of several sentences;
Try model factorization and pruning (i.e. attention head pruning);
Add some relevant meta-data from the spoken utterances, i.e. pauses or intonations (or any other embedding);