From the course: Generative AI: Introduction to Large Language Models
What is an encoder-decoder?
- [Instructor] Transformers, based on the 2017 paper, "Attention is All you Need," revolutionized the representation of words in natural language processing. Their ability to capture long-range dependencies, understand context, and learn hierarchical relationships within language data make them incredibly useful for large language models. In the previous video, I mentioned that at the high level, the transformer architecture consists of an encoding component and a decoding component. The encoding component accepts an input sequence and maps it into an abstract continuous representation that holds all the learned information about the input, while the decoding component takes the continuous representation and sequentially generates output while being fed its previous outputs. However, it's important to note that the encoding and decoding components of a transformer are actually a stack of encoders and decoders. As data flows out of one encoder, it flows into the next encoder in the stack. The same thing happens with the stack of decoders. By stacking multiple encoders and decoders this way, each encoder and decoder has the opportunity to focus on different parts of the input sequence and previously generate outputs, thereby boosting the effectiveness of the transformer. So what are encoders and decoders? In the language model, an encoder is a neural network that takes in a sequence of tokens, such as words or sub words, and encodes them into a fixed size vector representation that captures the contextual information of the input. A decoder is also a neural network that takes the encoder representation generated by the encoder and uses it to generate output text. Decoders operate auto-regressively, meaning that they generate one token at a time based on previously generated text. To understand how encoders and decoders are used in language models, let's walk through a toy example. Let's assume that we intend to build a model that predicts the most likely next word, given a sequence of input words. If our model only accepts three input words and is limited to eight words in its vocabulary, then the model could look something like this. This model accepts any of the eight possible words in the vocabulary as the first input, the second input, or the third input, it then predicts one of the eight possible words as the most probable next word or output. So if we provide the model with the inputs, "Once upon," and "a," we expect the model to generate the most probable next word, which is most likely, "time." As I mentioned, this is a toy example. It's obvious that a real-world model for this task would have to consider a lot more words than what we have here. The exact number of words in the English language is difficult to determine as it is constantly evolving. Some estimates have the number of words ranging from around 170,000 to over a million words. Let's assume that the lower estimate is correct. This means that to build a similar model that can accept any of the words in the English language, the model would need 170,000 times three input nodes and 170,000 output nodes, 680,000 nodes altogether. That is a lot of nodes, especially for a model that only accepts three input values. Such a model will take a lot of time and resources to train, and coders and decoders can help. Consider the following sentences. Each one of them begins with, "Once upon a time, in a kingdom far far away, their lived a," followed by "king," "queen," "prince," or "princess." A reasonable guess for the next word could be, "castle," "kingdom," or "palace." Any of them would do, why? Because there is an implied relationship between "king," "queen," "prince," and "princess." They are different words, but all imply royalty, and as we know, royalty tend to live in a castle, a kingdom, or a palace. Why is this important? Well, you see, when it comes to a language model, to minimize the complexity of a model, it's beneficial to have words that have almost the same meaning, serve as close substitutes for each other. Using a network that looks like this, an encoder can take the 170,000 words in the English language as input nodes and map them to a smaller set of nodes with the help of the decoder. This type of network is known as an autoencoder. The smallest set of nodes or hidden nodes are represented in what is known as a context vector. The values of the context vector are generated through a process known as self supervision. During this process, the encoder takes in a word, encodes it, and passes it off to the decoder as input. The decoder then attempts to predict the same word as the output. This may seem a bit silly, and you may ask why we encode a word only to decode it again. The reason is that as an autoencoder learns to correctly predict the right word based on the input provided, it starts to make compromises and group words together such that the context factor becomes a compressed representation of the semantic and contextual information for each input. For example, let's say that the words "king," "queen," "prince," and "princess" are represented by the following vectors. When fed to an autoencoder network, each input could be compressed into vectors that look like these. Notice that the vectors are almost the same, save for a slight difference in the second value. This suggests that all four words are similar and can serve as close substitutes for each other. It's important to note that not all transformers make use of both an encoder and a decoder. Those that do are commonly used for sequence-to-sequence modeling tasks such as language translation, text summarization, and question-answering systems. Some popular encoder-decoder transformers include BART and T5. Encoder only transformers are often used for natural language processing tasks, such as sentence classification, mass language modeling, and named entity recognition. Popular encoder only transformers include BERT, RoBERTa, and DistilBERT. Finally, decoder only transformers, which include the GPT series, are commonly used for tasks such as text generation and causal language modeling.