# Speeding up word distance calculation 100x

Originally published at spark-in.me on August 12, 2018.

A standard Word2Vec illustration from a seminal paper (I will do a paper review on the topic some time in the future)

Using some form of word vectors / embeddings is considered nowadays to be one ofstandard approaches for practical NLP related tasks.

In my opinion, the most useful in practice are these 2 approaches:

• Using fast-text (notice there is a python version now) vectors (expecially for “complicated” languages with cases);
• Using end-to-end CNN-based approaches and maybe using CNN’s embeddings to get vectors (e.g. consider this transformer tutorial);

In this article we will just assume that you have your word vectors and you need to use them for some downstream task.

# So what is the fuss all about?

Imagine that you have a reasonably large dataset, for example with 1–10m queries / phrases / sentences. Ideally, to provide value for business / applied tasks, you need to:

• Investigate the data structure;
• Build a streamlined annotation process (data exploration => annotation => models => distillation => even more annotation => better models etc);
• Be able to match internal datasets with external ones;

In practice matching usually implies calculating some kind of distance in the most simplistic case, e.g. cosine disntance Well, 1–10m * 1–10m equals a LOT of operations.

In real applied tasks you may have little to no annotation. Hence you will have to be creative and do something of this sort:

• Find some annotation;
• Enrich it with some external sources;
• Manually check / correct it;
• Repeat;

An the problem is — calculating 10⁶ * 10⁶ distances is very slow.

# The solution

In a nutshell — understand the internal structure of your data and exploit some optimization tricks:

1. 4x speed-up — cluster your data and dispose of the rubbish clusters;
2. 10x speed-up — use some form of multiprocessing / multiple cores for calculation (or maybe do them in batch form on the GPU);
3. 5x speed-up — use numba to compile your distance function into C;
4. 10x speed up — if you do even more proper clustering, you can dispose of very distant clusters that are most likely not relevant;

In on applied case I manage to speed up my calculation from 27 days to 4–5 hours.

## Clustering

In my view, UMAP + HDBSCAN work reasobably well for word vectors. UMAP even has a cosine distance option (though everything is not 100% smooth there).

Also HDBSCAN has a great perk — it produces a -1 “rubbish” cluster. In practice you may end up with something like this, which has reasobably good local structure

So, using UMAP + HDBSCAN you can remove ~50% of rubbish data. But you can also cluster your dataset into a reasobable amount of clusters (notice — you do not set number of clusters as a parameter of HDBSCAN, which arguably is its best advantage) — 5k — 10k. Then you can just calculate distances between your target phrases and these clusters and throw away the distant ones.

In plain terms — 10³ x 10⁶ << 10⁶ x 10⁶. Also cosine distance on such large scale is usually normally distributed, so you can get away with keeping all relations with distance being smaller than mean — standard deviation.

## Numba snippet that worked for me

`@numba.jit(target='cpu', nopython=True) def fast_cosine(u, v): m = u.shape udotv = 0 u_norm = 0 v_norm = 0 for i in range(m): if (np.isnan(u[i])) or (np.isnan(v[i])): continue udotv += u[i] * v[i] u_norm += u[i] * u[i] v_norm += v[i] * v[i] u_norm = np.sqrt(u_norm) v_norm = np.sqrt(v_norm) if (u_norm == 0) or (v_norm == 0): ratio = 1.0 else: ratio = udotv / (u_norm * v_norm) return 1-ratio`

borrowed from here

Originally published at spark-in.me on August 12, 2018.

Data Scientist