On Similarity Search
A Primer on the Math that Powers SOTA Language Models like GPT-4, Gemini, and Claude
Introduction
A common use case for many applications is similarity search. This is the task of finding items that are similar to a given query item and displaying them (generally in descending order of relevance*).
*Relevance is a measure of how similar the items are to the query item. For example, in LinkedIn’s search engine, if you wanted to find jobs for “software engineer”, the most relevant results would be those that contain the term “software engineer” BUT ALSO those that contain related terms such as “developer”, “programmer”, “software developer”, etc.
While feedback-based mechanisms are the secret sauce behind sophisticated recommender systems like TikTok and Instagram, similarity is a fundamental component that makes them work.
What are Embeddings?
So how do we know if two different words (or sentences, images, audio, video, etc.) are similar? One common approach to similarity search is to use embeddings.
An embedding might look something like this: [0.1, 0.5, -0.3, 0.8,…, n]
Embeddings are high-dimensional vectors (basically a long array or list of integer or floating point numbers) that represent the meaning of a piece of text. These vectors are obtained from pre-trained models that have been trained on large corpora of text data.
Transformers
Transformer-based models are the current SOTA for text embeddings. These models are trained to predict the next word in a sentence given the previous words. They were first proposed by Vaswani et al., 2017, popularized by the BERT model (Devlin et al., 2018), and are the basis for many other models such as GPT-4. The weights must be learned over time and are set by passing byte-pair encoded integers aka tokens through a series of linear layers and non-linear activation functions — we call this the training process.
The pre-trained weights (or parameters) of the model are then used to obtain the embeddings — representing the last hidden state of the model just before the final Softmax layer which generates the probability distribution over the vocabulary and allows us to predict a sequence of tokens given an input sequence (where tokens can be mapped back to words or subwords in the vocabulary).
Mathematically
To be more precise, the model is a function that takes a sequence of tokens as input and returns a sequence of embeddings as output. To better understand the math behind the transformer model, let’s take a look at the key components of the model:
Softmax
We’ll start at the end. Softmax represents the final layer of the neural network which converts logits into probabilities. Softmax is super flexible and can be used in a variety of tasks, from simple classification to the complex decision-making needed in transformer models for language understanding and generation.
where:
- z_i is the input to the i-th unit,
- K is the number of units in the layer,
- e is the base of the natural logarithm.
In the training phase, softmax plays nicely with backpropagation — a method used to train neural networks — by providing smooth gradients. This means it helps in adjusting the model’s parameters more efficiently, leading to better learning. Softmax outputs are often paired with a cross-entropy loss function during training, which measures the difference between the predicted probabilities and the actual distribution. This combo is like peanut butter and jelly for training classification models — it just works really well.
(Scaled Dot-Product) Attention
The key component of the transformer model is the attention mechanism (hence attention is all you need). This mechanism allows the model to focus on different parts of the input sequence when generating the output sequence. The attention mechanism is used to calculate the attention scores between the query, key, and value vectors, and then use these scores to calculate the weighted sum of the value vectors.
where:
- Q is the query matrix,
- K is the key matrix,
- V is the value matrix,
- d_k is the dimensionality of the key vectors.
- T denotes the transpose of a matrix.
The division by sqrt(d_k) is used to scale the dot product, which helps to stabilize the gradients during training. The softmax function is used to calculate the attention scores, which are then used to calculate the weighted sum of the value vectors.
Multi-head attention
The multi-head attention mechanism is a way to allow the model to focus on different parts of the input sequence at the same time. This is done by using multiple sets of query, key, and value vectors, and then concatenating the results of the attention mechanism for each set of vectors.
where
- head_i = Attention(Q W^Q_i, K W^K_i, V W^V_i)
- W^Q_i, W^K_i, W^V_i are the weight matrices for the i-th head, and W^O is the output weight matrix.
Multi-head attention enables the massive parallelization of the attention mechanism, which is a key factor in the speed of the transformer model when operating in GPU-enabled environments and allows models to process sequences with significantly longer context than previous models.
Inference
Inference with transformer-based models can really happen in 2 ways:
1. Encoder — where the model is used to generate embeddings for the input sequence. This is the case when we want to use the model to obtain text embeddings for similarity search. The OG BERT model really popularized this paradigm, but a good modern option is the gte-small model from thenlper. This is a high-performance transformer-based model that finds a nice balance between speed — due to its relatively low parameter count and embedding dimensionality of 384 — and performance (due to its training on a large corpus of text data)
2. Decoder — where the model is used to generate the output sequence given the input sequence. This is the case when we want to use the model to generate text, translate text, or perform other tasks that require generating a sequence of tokens. This class includes the GPT family of models.
Computing Similarity
There are various methods for computing similarity between two text embeddings, but one of the most common is the cosine similarity.
The formula for cosine similarity between two vectors x and y is given by:
where:
- x • y denotes the dot product of vectors x and y,
- |x| and |y| denote the magnitude (or norm) of vectors x and y, respectively,
In the context of text embeddings, x and y represent the high-dimensional vectors obtained from embedding two pieces of text. In our case, these vectors would be 384-dimensional, as defined by the pre-trained model (thenlper/gte-small), and the similarity score varies between -1 and 1, where:
- 1 means the vectors are identical (in terms of direction),
- 0 means the vectors are orthogonal (independent or no similarity),
- -1 means the vectors are diametrically opposed (but still related in a linear sense).
In practice, for text similarity tasks, the focus is typically on the positive range [0, 1], with values closer to 1 indicating higher similarity between the texts represented by the vectors.
To simplify calculations, we can normalize the vectors to have a magnitude of 1 such that cosine(x, y) is equivalent to x • y.
Visualizing Embeddings
The trained parameters are used to obtain the embeddings, representing the last hidden state of the model just before the final Softmax layer which generates the probability distribution over the vocabulary.
Embeddings allow us to identify examples of text that are similar to a given query text without having to rely on exact string matching, fuzzy full-text search, term frequency, or other less efficient methods.
Some similar examples which all contain text about Boston ground transportation:
Notice that the above embeddings share a similar distribution (heatmap pattern) for the 3 inquiries about Boston ground transportation, while the below inquiries have a different pattern.
As we are dealing with normalized vectors, the cosine similarity is a measure of the angle between the vectors. This means that the cosine similarity is invariant to the magnitude of the vectors, and only depends on the direction of the vectors.
Of course, these embeddings are in much higher-dimensional spaces (384 dimensions in this case), so we can’t visualize them directly. However, we can use dimensionality reduction techniques (such as UMAP, t-SNE, or PCA) to project these high-dimensional vectors into 2D space, and then visualize them.
In this example, we can see that the 2D representations (coordinate locations) of the 3 inquiries about Boston ground transportation are close together, indicating that they are similar, other topics such as airlines and tickets are further away, but also close to each other, indicating some semantic relationship encoded in the latent representations.
Conclusion
That’s all for now! Hopefully, this little primer shed some light on the inner workings of transformer networks and helped you peer into the black box. We went pretty deep in the weeds with this one, so if you stuck with me till the end, I appreciate it!
Please follow me for more on software engineering, data science, machine learning, and artificial intelligence.
If you enjoyed reading, please consider supporting my work.
I am also available for 1:1 mentoring & data science project review.
Thanks! 👋🏻
References
Papers
- Language Models are Few-Shot Learners — Brown et al. 2020
- Attention is All You Need — Vaswani et al. 2017
- BERT Paper — Devlin et al. 2018
- UMAP Paper — McInnes et al. 2018
- t-SNE Theory Paper — Cai, Ma 2021
- GPT-4 Technical Report — OpenAI 2023
Models
Articles
- Understanding Encoder And Decoder LLMs — Sebastian Raschka
- Text Embeddings Visually Explained — Cohere.ai
- Byte-Pair Encoding tokenization — Huggingface
- Classical ML Equations in LaTeX
- The Beginner’s Guide to Text Embeddings — Deepset.ai
Wikipedia
In Plain English 🚀
Thank you for being a part of the In Plain English community! Before you go:
- Be sure to clap and follow the writer ️👏️️
- Follow us: X | LinkedIn | YouTube | Discord | Newsletter
- Visit our other platforms: Stackademic | CoFeed | Venture | Cubed
- More content at PlainEnglish.io