Calmcode - embeddings: language is hard

Similarity between embeddings is hard to interpret, especially when you confuse it with word-meaning.

1 2 3 4 5 6 7 8 9

Natural language is hard, so we should think about what clusters we might expect when we train word embeddings. So let's consider a few tricky aspects of associating words with a numerical space.

Fast and Slow

The word "fast" is similar to the word "slow" in the sense that both words describe the speed of things. When you consider how both words are used in sentences you may also imagine how both words have similar words surrounding them (a car can be fast, a car can be slow, etc).

But there's also an awkward thing: fast and slow are words with the opposite meaning. So we might need to accept that similarity in numeric space might not imply that two words have the same meaning.

What is "Brussels"?

Similarily, what do you make of "Brussels"? Is that a city? Or might it refer to political decision making in the european union?

Depending on how the word is used, it could mean either thing! So how might that influence it's position in numeric space? Would that word be part of two clusters? Or none?

Beware bias

Another awkward side effect is that the numeric representation of words might encode bias. Just consider words like "he", "she", "doctor" and "nurse". Statistically, when you train on wikipedia, you might have a corpus that associate "he" with "doctor" and "she" with "nurse". But that's a pattern that might be harmful. There's no reason why a man couldn't be a nurse or why a woman couldn't be a doctor. But if you end up with a numeric representation where the distance between words imply predictive power ... you should remain mindful.