Embeddings

An interactive tutorial

In this chapter, we’ll build a semantic search engine. We’ll use distributional semantics to represent words as embedding vectors. Then we’ll use vector arithmetic to represent all text as vectors. Finally, we’ll use normalization and cosine similarity to compare those embeddings. Along the way, you’ll learn the amazing power of vectors, and even how to calculate $spider - eight + six$ .

Searching the Congo

Ahh, another day working at $C ongo .com$ , the world’s largest online marketplace. This morning, your boss is complaining: lots of users’ product searches have no results. Here are some books on the warehouse shelves:

Grease, Axel. The Complete Idiot’s Guide to Auto Repair. 3rd ed.

Shaft, Camilla. Truckin’ Hell: An Intro to Pickup Upkeep. 2019.

Stands, Jack. Truckers’ Secret Handbook. Revised ed.

...

And yet, if you search for “beginner truck maintenance” – nada. As the resident data scientist, you’ve been tasked with fixing the problem. Here’s the current search algorithm:

query_words = query.lower().split()

def is_match(product):
    return all(word in product.description
               for word in query_words)

results = list(filter(is_match, products))

Why does the search for beginner truck maintenance return no results?

The current search is doing pure keyword search: it demands that the description contains the word maintenance, but what if it instead contains synonyms like repair or upkeep?

We don’t just need a keyword search; we need a semantic search. Which of these would be most useful to build a simple semantic search engine?

Well, kind of. A dictionary shows the definition of each word. If we looked up the word “maintenance”, we might see the definition “keeping something in a working state”. But to find all the other words to search for, we’d have to read through the entire dictionary to find every other word with a similar definition.

No, a more efficient choice is a thesaurus. When we have a word to search for, we can look it up in the thesaurus, then search for all of its synonyms.

Right! When we have a word to search for, we can look it up in the thesaurus, then search for all of its synonyms.

I’m not sure a rhyming dictionary would be useful here. A rhyming dictionary shows words that rhyme, e.g. given “maintenance”, it might list “flatulence” or “sentence”. Rhyming doesn’t correlate with semantic relationship.

No, we need a thesaurus. When we have a word to search for, we can look it up in the thesaurus, then search for all of its synonyms.

Unfortunately, the Merriam-Webster is hopelessly out of date, and $C ongo .com$ has books in 1,000+ languages on every conceivable topic. No, to build our semantic search, we first need to build our own thesaurus.

Distributional semantics

Indeed, it seems hard to get a computer to measure the meaning of a given word. But perhaps we can instead get the computer to measure whether two given words have similar meaning.

Let’s play a game. I’ve picked two secret words $W_{1}$ and $W_{2}$ . Your job is to guess whether $W_{1}$ and $W_{2}$ have the same meaning. To help make your guess, you can choose just one of the following clues:

Well ... okay, the length of $W_{1}$ is $5$ characters, and the length of $W_{2}$ is also $5$ characters. But I suspect this clue doesn’t help you much. I think the most useful clue is the common words that follow $W_{1}$ and $W_{2}$ .

Okay, that sounds useful!

Well ... okay, $W_{1}$ is 0.002757% of all words, and $W_{2}$ is 0.008093% of all words. But I suspect this clue doesn’t help you much. I think the most useful clue is the common words that follow $W_{1}$ and $W_{2}$ .

Well ... okay, the first two letters of $W_{1}$ are be, and the first two letters of $W_{2}$ are st. But I suspect this clue doesn’t help you much. I think the most useful clue is the common words that follow $W_{1}$ and $W_{2}$ .

To see how this works, we’ll work with this 1GB sample of text from Wikipedia. Data scientists call big collections like this a corpus. In this corpus:

Ten words that often follow $W_{1}$ are again, airing, construction, operation, play, playing, production, until, work, and working.

Ten words that often follow $W_{2}$ are again, date, off, out, playing, point, program, until, up, and working.

Now, do you think $W_{1}$ and $W_{2}$ are close in meaning?

Good guess. The words were $W_{1}$ = begin and $W_{2}$ = start. I’d say they’re close in meaning! And this is why they share four following-words.

The words were $W_{1}$ = begin and $W_{2}$ = start. I’d say begin and start are close in meaning. And this is why they share these four following-words: again, playing, until, and working.

It seems reasonable to say: if two words are close in meaning, it causes them to have similar following-words. But does this work in reverse? If two words have similar context-words, does that cause them to be close in meaning?

Here’s a thought experiment. Suppose you start a new job. Your co-workers keep using the strange word grond. “Work gronds at 9am. When it finishes, we’ll grond partying.” Over time, you discover that the word grondhas exactly the same distribution of context-words as the word start. Does this mean the word grond means the same as the word start?

Then you agree with what’s called the distributional hypothesis. This hypothesis says that “a word is characterized by the company it keeps.”

Interesting! Then you disagree with what’s called the distributional hypothesis. This hypothesis says that “a word is characterized by the company it keeps.”

There’s evidence that children acquire much of their vocabulary using this method. According to the distributional hypothesis, the list of words surrounding begin is a sort of “fingerprint” of the word begin. A fingerprint that the computer can measure, and perhaps acquire language like children do!

What does the fingerprint measure?

Let’s see how far the distributional hypothesis goes, by playing the word-guessing game once more. I’ve got two new words, $W_{3}$ and $W_{4}$ .

Ten words often following $W_{3}$ are 1960s, 1970s, 1980s, 1990s, 19th, 2000s, 20th, days, medieval, and years.

Ten words often following $W_{4}$ are 18th, 1940s, 1950s, 1960s, 1970s, 1980s, 1990s, 19th, cretaceous, and night.

Do you think $W_{3}$ and $W_{4}$ are close in meaning?

The words were $W_{3}$ = early and $W_{4}$ = late. Would you still say these are close in meaning?

The words were $W_{3}$ = early and $W_{4}$ = late. They are antonyms.

We commonly say antonyms are “opposite in meaning”. Yet they share many concepts, and therefore have similar word-contexts. For example, the words early and late share concepts like time, comparisons, and events, and therefore include similar following-words, like 1960s.

So our “word fingerprint” idea is not doomed. It’s just that these fingerprints don’t quite measure synonymousness. They instead measure something more like number of shared concepts. But that’s exactly what we want for our task of finding relevant sentences in a long product description!

Word fingerprints are lists of numbers

To build our thesaurus, we’ll first take our big Wikipedia text corpus, and count up which words follow each other. We end up with a big table like this:

Current word truck car build manufacture ... 9,996 more rows ... the 86435759115 Next word is 70394278 ... 9,998 more cols ... ... ... ... ...

We’ve considered the most frequent $10000$ words in the corpus. The table above is a sample from that giant table. It shows four current words, truck, car, build, and manufacture, and two following words, the and is. How many times in the corpus is the word truck followed by the word is?

No, truck is followed by is $70$ times. It’s followed by the $86$ times.

It’s shown in the top-right corner. Look for the cell where truck is on its left, and is above it. truck is followed by is $70$ times.

Above, we said that a word’s “fingerprint” is a list of words that commonly follow it. In the full table, how big is one word’s fingerprint?

The full table has $10000$ columns, so a single word fingerprint is actually be a list of $10000$ numbers: one for each possible word that follows it.

No, a single word fingerprint will be a list of $10000$ numbers: one for each possible word that follows it.

That’s pretty big! To keep things manageable for now, let’s just consider two of those following-words, the and is. What is the word fingerprint of the word truck?

No, the word fingerprint of truck is [86, 70]. It’s followed by the $86$ times, and is $70$ times. The fingerprint [86, 435, 759, 115] describes the previous-words of the word the. We won’t consider those fingerprints here.

Vectors are points in space 🪐

Suppose we’re searching in a document for words related to truck. Use your intuition: which of the following words is most related to the word truck?

Yes, that seems pretty close in shared concepts.

Interesting. I’d say the word car is most related. If I search for the word truck, I’d want to see results with car before results with the word build.

Interesting. I’d say the word car is most related. If I search for the word truck, I’d want to see results with car before results with the word manufacture.

Let’s see if we can measure that. How could we measure how related any two words are, as a number?

First, some terminology. A vector is a list of numbers. A word fingerprint like [86, 70] is an example of a vector. When we use the word “vector”, it’s usually because we’re visualizing the number-list as the coordinates of a point in space. For example, we can visualize our word fingerprints as points in a 2D space:

A plot, with two axes. On the x-axis is the count of the word "the". On the y-axis is the count of the word "is". Each word is a point. The points are labelled A, B, C and D.

These words are embedded in a 2D space. For this reason, word vectors are usually called embeddings.

Consider embedding C in the plot. Which word does C represent?

No, it’s the word build, which has vector [759, 27], and so is plotted at $x = 759$ and $y = 27$ .

Right, it’s the word build, which has vector [759, 27], and so is plotted at $x = 759$ and $y = 27$ .

No, it’s the word build, which has vector [759, 27], and so is plotted at $x = 759$ and $y = 27$ .

Euclidean distance 📐

A key value of thinking about points in space instead of lists of numbers is that we can intuitively measure distances between points. Here’s Euclid’s formula for measuring the distance between two points in 2D space:

d (a, b) = (a_{x} - b_{x})^{2} + (a_{y} - b_{y})^{2}

We can use the Euclidean distance formula to find the word with the closest embedding to truck. We’re effectively measuring the lengths of these lines:

The same plot as above, but where lines are drawn from the word 'truck' to each of the other words.

Using this formula, which word embedding is closest to the word truck?

That might be what we want, but by the formula, the closest word is actually manufacture! Look at the lengths of the dotted lines.

No, by the formula, the closest word is actually manufacture. Look at the lengths of the dotted lines.

Right! That’s odd, isn’t it?

Earlier, we said the closest word to truck should be car. But our distance formula says the closest word embedding is manufacture. Our algorithm is not aligned with our desires.

To “fix” this, which of the following could we change?

Either of these can work! Later we’ll see a solution that changes how we measure distance. But first, let’s see a solution that changes the points of the words on the plot.

Normalization 🏐

What does our space of the vs. is actually represent? Suppose I pick a new word $W_{5}$ , and show you its word vector, $[c_{t h e}, c_{i s}] = [13, 123]$ . What does this vector tell you about the word $W_{5}$ ?

Right!

That’s tempting, but our 2D space doesn’t have enough information for us to say that $W_{5}$ is related to trucks. But the vector [13, 123] can tell us that $W_{5}$ is a noun, and here’s why.

The number $13$ is just the count of the times the followed $W_{5}$ in the corpus. The vector [13, 123] can actually tell us that $W_{5}$ is a noun, and here’s why.

The word is commonly follows nouns, as in “the truck is red”. The word the commonly follows verbs, as in “manufacture the truck”. The unknown word is mostly followed by is, so it’s probably a noun. It turns out our plot is mostly a plot of $nouns$ vs. $verbs$ :

A plot of various nouns and various verbs. Nouns are blue, verbs are red.

Let’s make this more precise. Say there’s some word embedding $[c_{t h e}, c_{i s}]$ . Which of the following values would be most helpful to determine whether that word is a noun?

The ratio $\frac{c _{i s}}{c _{t h e}}$ is the most helpful. Nouns will have a high value, significantly above $1.0$ .

That ratio $\frac{c _{i s}}{c _{t h e}}$ is the slope of the line from the origin to the point, which we can visualize like this:

The plot above, with each point shown as a vector

Above, instead of plotting each word as a point, each word is shown as an arrow. This is yet another meaning of the word vector – an arrow that has two things:

A direction: where it “points”.

A magnitude: how long it is.

Given a word vector in our 2D space, the direction of that vector measures the word’s noun-ness. But what does its magnitude measure?

No, the word’s length in characters is not clearly related to the magnitude/length of its word vector. See for example the word car, which is only $3$ characters long, but whose vector is pretty long (i.e., has a large magnitude).

The word vector’s magnitude here represents how frequently the word is used.

I’m not sure what “importance” would mean here. No, the word vector’s magnitude here actually represents how frequently the word is used.

Right!

So, when we looked for the closest word to truck, why did we get manufacture rather than car?

No, manufacture and truck have similar frequency. But the word car is much more frequent than either. This is why the word car is far away from the other two.

Right, the words car and truck were pulled away from each other due to their differing usage frequency.

For our purposes, we don’t care about word usage frequency. One way we can fix this is to set all magnitudes to $1$ . This is called normalization, and it looks like this:

The plot above, with all vectors normalized.

Earlier, when searching words related to truck, the Euclidean distance gave us the word manufacture. After we’ve normalized the vectors, what’s the closest word to truck, using the Euclidean distance?

We’ve made good progress!

Angles 📐

Notice how all the arrow-heads of the normalized vectors lie on the unit circle. There’s a natural way to talk about the position of a point on a circle: the angle. Let’s measure the angle between the word truck and each of the other words.

A plot of the word vectors for truck, car, build, and manufacture, with the angle between each word and truck shown.

Which word has the closest angle to truck?

Earlier, we saw the Euclidean distance. We’ve found a different kind of distance: interpret the points as vectors, then take the angle between them. This is called the angular distance between two vectors. In our plot of the vs is, what is the largest possible angular distance between two word vectors?

This one is tricky! If the vectors could point in any direction, the largest possible angular distance is $180°$ . But all the values in our word vectors are positive counts, so they all point up and to the right. So the largest possible angular distance is only $90°$ .

In data science, we don’t often use the angular distance. Instead, we use something called called cosine similarity, written as $S (a, b)$ . This is the cosine of the angle between the two vectors.

The cosine similarity between two normalized vectors has a formula that’s easy to calculate, called the dot product:

S (a, b) = a \cdot b = (a_{0} \times b_{0}) + (a_{1} \times b_{1}) + \dots + (a_{n} \times b_{n})

Suppose the normalized vector for the word phone is $p = [0.6, 0.8]$ . What is $S (p, p)$ , i.e., the cosine similarity between $p$ and itself? (Hint: $cos (0°) = 1$ .)

Right, it’s $(0.6)^{2} + (0.8)^{2} = 0.36 + 0.64 = 1$ .

It’s actually $1$ , which we can calculate as:

S (p, p) = p \cdot p = (p_{x} \times p_{x}) + (p_{y} \times p_{y}) = (0.6 \times 0.6) + (0.8 \times 0.8) = 0.36 + 0.64 = 1

It’s no coincidence that this is $1$ . If two vectors point in the same direction, their cosine similarity is $1$ . If they point in opposite directions, their cosine similarity is $- 1$ . And if they are orthogonal, their cosine similarity is $0$ . Which of the following has the largest value?

The vector for truck points in the same direction as itself, so the cosine similarity is $1$ . The vector for car points in a slightly different direction, so the cosine similarity is a little less: $0.998$ .

Word arithmetic 🧮

As a brief interlude, here’s a word puzzle: actor - man + woman ≈ ?

Considered purely as a word puzzle, the answer is actress.

But it turns out that, when we use our full $10, 000$ -dimensional word vector, this crazy arithmetic actually works! When replacing the words with their word vectors, and doing the arithmetic, we get a vector which is very close to the vector for actress.

Specifically, we’re using vector addition, which is defined as:

a + b = [a_{1}, a_{2}, \dots, a_{n}] + [b_{1}, b_{2}, \dots, b_{n}] = [a_{1} + b_{1}, a_{2} + b_{2}, \dots, a_{n} + b_{n}]

Try it yourself: what is $[- 1, 2] + [1, 3]$ ?

Here’s another: spider - eight + six ≈ ?

Why, it’s ant of course! By the logic of word vectors, an ant is a six-legged spider.

Sentence vectors 🗣

Alright, let’s get back to it. We’d found a way to improve on keyword search: use our word vectors to search for words similar to the query words, then include those in our keyword search.

That’s a valid approach, but let’s go all-in on vectors! You’ve seen that we can encode words as vectors. But, considering our new skill of adding vectors, what else could we encode as vectors?

All of these are possible! You can encode almost anything as a vector! For our purposes, it will be useful to encode sentences and queries. Then we could find the most relevant sentences to our query using the same technology we’ve developed above: cosine similarity between a query vector and a product vector.

But how can we get a vector for a sentence or for the query? One way is just to add together all the word vectors in the sentence. Consider the following two sentences:

S1: These terms will be modified frequently without notice.

S2: Notice modified terms will be frequently without these.

Will these sentences get different sentence vectors?

Because addition is commutative, these sentences will get the same sentence vector. Despite this naïve approach, addition can give us a useful summary of a text! Using this technique, we end up with this search engine algorithm:

word_vectors = {
    "truck": [86, 70, ..., 42],  # 10,000 dimensions
    "car":   [92, 65, ..., 38],  # 10,000 dimensions
    # ... 9,997 more words ...
}

def embed(sentence):
    words = sentence.lower().split()
    word_vectors = [word_vectors[w] for w in words]
    return sum(word_vectors)

query_vector = embed(query)

def product_rank(product):
    product_vector = embed(product.description)
    return cosine_similarity(query_vector, product_vector)

results = sorted(products, key=product_rank)

Wrapping up

In this chapter, we built a semantic search engine. Our goal was to find products that are most similar to a query string. We did that by converting all the text to vectors, called embeddings. To find word embeddings, we started with a large corpus, and characterized each word by the distribution of words that follow it. To find embeddings for queries and products, we just added up all their word vectors. To compare those vectors, we settled on normalization plus cosine similarity.

End-notes

The approach we took here is called word2vec. It’s a classic algorithm for getting word vectors.

To build the vector for a word, we considered just the immediately following words in the corpus. But word2vec considers more “context words”. For example, the two words after the current word, and the two words before it.

We built word vectors with 10,000 dimensions. Real word embeddings are typically smaller, perhaps ~300 dimensions. This can be achieved by dimensionality reduction. word2vec uses a neural network for this, although there are many other methods.

To get sentence embeddings, we just added up the words’ embeddings. This doesn’t allow words to change their meaning based on context. We could have used an “attention” mechanism for this.

Next in Everyday Data Science:

Coming soon

An Everyday Look At Synthetic Data 🆕

Or, Grow Your Own

The word ‘data’ derives from the Latin for ‘something given to you’. But what if you don’t have friends to give you data for your birthday? Why, you can just create some new data yourself, of course!

In this chapter, we’ll use simulation to train a model without ever dirtying our hands in the real world. It’s just like the Construct in The Matrix.

An Everyday Look At A/B Testing

Or, When Life Gives You Lemons 🍋

An Everyday Look At Differential Equations

Or, ODEs On A Diet! ⚖️

An Everyday Look At Time Series

Or, The Way You Do That Walk 👟

An Everyday Look at Imputation 🆕

Or: No, Garmin, I Didn’t Die In My Sleep

An Everyday Look at Anomaly Detection

Or, Your Body 👨‍👨‍👧‍👦

An Everyday Look At Embeddings

Or, All Books Live In $R^{300}$

An Everyday Look At Synthetic Data 🆕

Or, Grow Your Own

An Everyday Look At Bayesian Regression

Or, The Olympics is Calling 🏃‍♀️

An Everyday Look at Bayesian-Optimal Pricing 🆕

Or, Everything Must Go!

An Everyday Look at Graphs

Or, Walking The Dog 🐕‍🦺

Embeddings

An interactive tutorial

Searching the Congo

Distributional semantics

What does the fingerprint measure?

Word fingerprints are lists of numbers

Vectors are points in space 🪐

Euclidean distance 📐

Normalization 🏐

Angles 📐

Word arithmetic 🧮

Sentence vectors 🗣

Wrapping up

End-notes

Next in Everyday Data Science:

An Everyday Look At Synthetic Data 🆕

Or, Grow Your Own

An Everyday Look At A/B Testing

Or, When Life Gives You Lemons 🍋

An Everyday Look At Differential Equations

Or, ODEs On A Diet! ⚖️

An Everyday Look At Time Series

Or, The Way You Do That Walk 👟

An Everyday Look at Imputation 🆕

Or: No, Garmin, I Didn’t Die In My Sleep

An Everyday Look at Anomaly Detection

Or, Your Body 👨‍👨‍👧‍👦

An Everyday Look At Embeddings

Or, All Books Live In R300

An Everyday Look At Synthetic Data 🆕

Or, Grow Your Own

An Everyday Look At Bayesian Regression

Or, The Olympics is Calling 🏃‍♀️

An Everyday Look at Bayesian-Optimal Pricing 🆕

Or, Everything Must Go!

An Everyday Look at Graphs

Or, Walking The Dog 🐕‍🦺

Embeddings

An interactive tutorial

Or, All Books Live In $R^{300}$