From the course: Generative AI: Introduction to Large Language Models

The Transformer architecture

From the course: Generative AI: Introduction to Large Language Models

The Transformer architecture

- [Instructor] You know that expression, when you have a hammer, everything looks like a nail? Well, in the world of natural language processing, it seems like we really have discovered a magical hammer for which everything is, in fact, a nail. This hammer is called a transformer. Transformers are essentially a new type of neural network architecture that uses a mechanism known as self-attention to capture the relationship between words. Almost every large language model is built using a transformer. Transformers were first introduced in a 2017 paper titled "Attention Is All You Need." The attention part in the paper title is very important. It is a key feature of the transformer and one of the most important contributions it makes to the field of natural language processing. Later in the course, I will discuss the importance of the attention mechanism and illustrate how self-attention works. At a high level, the transformer architecture consists of an encoding component and a decoding component. I discuss the specifics of the encoding and decoding component in the next course video. The encoding component of a transformer accepts an input sequence and maps it into an abstract continuous representation that holds all the learned information about the input. The decoding component then takes the continuous representation and sequentially generates output while also being fed its previous outputs. For example, let's consider a transformer that has been trained to translate between English and Spanish. We can start by providing the model with the English greeting hi, how are you? This entire input is passed all at once to the encoding component. Note that a transformer is a neural network, and neural networks learn through numbers. So when we pass a sequence of words or tokens through the transformer, each word or token has to be represented as a numerical vector known as an embedding. For example, before the word hi can be passed to a transformer, it would have to be represented as an embedding that looks like this, and so with how, are and you. The encoding component accepts all vectors in the input sequence at the same time, so it has to somehow account for word order. To accomplish this, transformers add a vector to each input embedding before the embedding is passed to the encoding component, such that the transformer can determine the position of each word or the distance between words in the input sequence. This process is known as positional encoding. The folks who introduced the transformer came up with a clever trick to do positional encoding using the sine and cosine functions. I won't go into the mathematical details of the process, but the general idea is that for every odd embedding in the sequence, add a vector created using the cosine function to the input embedding, and for every even embedding in the sequence add a vector created using the sine function to the input's embedding. The sine and cosine functions were chosen in tandem because they have linear properties that follow a specific pattern that the transformer can easily learn. As the input embeddings with positional information flow into the encoding component, they are encoded and fed to the decoding component, which outputs a vector representation of its prediction. How do we turn that vector into a word? That's the job of the final two layers of the transformer, a linear layer, which is followed by a softmax layer. The linear layer is a neural network that projects the vector produced by the decoding component into a much, much larger vector called a logits vector. The size of the logits vector is determined by the number of unique words in the model's vocabulary. Let's assume that our model knows only 10 unique words. Each value within the logits vector will correspond to the score of each of those 10 words. The job of the softmax layer is to convert the logits scores into probabilities. The word with the highest probability in the model's vocabulary is chosen and is predicted as the output during this time step. In our example, we can imagine that the Spanish word hola will be predicted in the first time step. This output is then fed back to the decoding component to help inform or provide context to its prediction in the next time step. This is known as autoregression. In the next time step, based on the initial input, as well as the prediction from the previous time step, the model would predict como. Both words, hola and como, are then fed back into the decoding component to help inform the next prediction, which is estas. In the next time step, all three words are fed back into the decoding component. However, the decoder figures out that the translation is complete and returns an end-of-sequence token. This concludes the language translation process.

Contents