From the course: Introduction to Large Language Models (LLMs) and Prompt Engineering by Pearson
Popular modern LLMs
From the course: Introduction to Large Language Models (LLMs) and Prompt Engineering by Pearson
Popular modern LLMs
So, let's turn to some popular modern LLMs, as I've name dropped them already, but let's really dig a little bit deeper. The first one is called BERT, which stands for the Bidirectional Encoder Representation from Transformers. Now, hopefully with this brief intro, you now have a little bit more sense for what those words mean. The B, bidirectional, refers to the fact that BERT is an auto-encoding language model. It has access to context before and after bidirectional of that blank spot. The E, encoder, comes from the fact that the transformer, being originally a sequence to sequence model, had two parts, the encoder, the part that took in text, and the decoder, the part that outputs text. So BERT only uses the encoder, the part that takes in text, the auto-encoding, you see the terms now, auto-encoding part of the transformer. R stands for representation, which is kind of need some consonant eventually. And it comes from the fact that the representation of text, those vectors are being generated through attention mechanisms. And T for transformers, not the movie, But the 2017 deep learning architecture that came out, that is now the parent model for all of these LLMs we'll be talking about throughout our time together. Originally, BERT was developed by Google in 2018, within a year of the original paper coming out. And it is considered, well it is, one of the first large language models out there. The original size had just over 100 million parameters, which is kind of why I put the floor near there, because BERT, in its right, should be considered a large language model. Now, specifically, as I mentioned, it excels at natural language understanding, NLU. So if you're thinking about using a task for, let's say, classification or semantic search, both things we will be doing together, BERT is, frankly, one of the best open source options you have. With any LLM, any LLM, you, my dear listener, my dear viewer, should be thinking about what went into training that model and was that data obtained ethically and fairly. So for the models I talk about throughout our time together, I will make a very conscious effort to explain and show you what data it was trained on and what that actually means, technically speaking, for the task we are trying to accomplish. So BERT, at least in its original form that we will be using throughout our time, and frankly for the later versions as well, was trained on two main corpora. English Wikipedia, which at the time was 2.5 billion words, and I have a link to it here if you want to download it yourself, it's a completely open data set. And the BookCorpus, which is also available for free on Hugging Face, which is about 800 million words. Now, these two things together comprise of what let BERT learn about language. And if you're thinking, how, through language modeling. What BERT would do, it would take a paragraph from Wikipedia or BookCorpus, randomly replace words with blank spaces, and test itself on, can I fill in the blank? And it knows the answer. The answer was there. It just removed it first. So this kind of self-supervised pre-training is what lets any LLM learn not just semantics and words and tokens and meanings, but it also helps them, for the first time, see information being written. English Wikipedia, for the most part, is pretty factual. So as it's performing language modeling, BERT is also picking up on some factual information. Now, this is not going to matter too much for BERT, but will start to matter severely, if I may be frank, with autoregressive models. When they start to speak back at us, we start to really see what it learned during its pre-training. The next model, which we'll be using together, is T5, or the text-to-text transfer transformer. Text-to-text refers to the fact that it is the original format of a transformer, sequence-to-sequence. It has both the encoder and the decoder. And this model was only created two years after BERT in 2020. There has been a refresh of T5 in the last year, I should say, but it's not that old of a language model either. The fourth T, transfer, implies that it's relying on something called transfer learning, which is a topic we'll be discussing in much more detail later on. But I'll tell you what it basically means is the ability for an LLM, or any model I should say, but in our case LLMs, to pre-train on data, say Wikipedia and BookCorpus, and transfer that learning to another task, whether it be classification, chatbot, writing emails back to whoever with personality, That transferring of knowledge from pre-training to fine-tuning is what transfer, the T in transfer, is referring to. And of course, we'd be remiss without our final T, transformer, a nice homage to the fact that all of these models are trained from the transformer. Now, T5 is actually extremely interesting because it's one of the first models, LLMs, I should say, to tout the ability to solve multiple natural language processing tasks out-of-the-box. So T5 in 2020 is mostly among researchers. I should say one of the first times that we're really seeing, especially in an open source setting and LLM doing stuff that humans want it to do. Translation, correcting grammar, answering questions in some cases, without having to do much to the model itself after the researchers have released it into the wild. So people start to get a sense for, hey, these LLMs aren't just great at classification, you know, pretty classical NLP tasks, but they're also showing promise in performing human tasks, like can you translate this from English to Turkish or vice versa? And that's part of the reason that's true is because when pre-training T5, the researchers at Google did two things. One, they use a pretty extensive open internet data set called Common Crawl, available at commoncrawl.org, which includes a very diverse set of text, where if I were to nitpick a few things about BERT, one of the things that they did is they only used English Wikipedia. But T5 had access to, I shouldn't say an equal representation of languages, but a better representation of non-English languages. The researchers at Google also took the time to teach T5 to perform a few tasks. For example, translation, and grammar correction, and summarization. So the researchers did take the time to teach T5 the basics of this task, but they found that it was able to take in a relatively small amount of summarization data and use that and transfer it to a larger use, where most people would be able to summarize text data that had really never been seen in its pre-training. So again, in 2020, a couple of years ago, this is one of the first times an LLM is really saying, okay, I've seen data and I've seen basic examples of solving a task. I think I got this. I think I can do this for most other data you throw at me. And that was promising. And this is really where the game starts to change. Because GPT, or the generative pre-trained transformer, there is that G generative, coming from natural language generation being an autoregressive language model, having been pre-trained using a large amount of data. And of course, T for transformer, was first developed in 2018. Funny enough, a couple of months before BERT, GPT-1 at least. Now, GPT relies on the transformer's decoder, the part that spits out text, and excels at natural language generation, i.e. generative AI. And these tasks include summarization, creative writing, you know, write me a tweet, write me a business plan, write me a travel itinerary for Paris or whatever. So GPT takes that thinking of an LLM being able to perform multiple tasks and really takes it to the next level. And we'll see how they really took that to the next level in our coming time together. But I should note, again, that GPT refers to a family of models. Starting in 2018 with about 100 million parameters, we are currently at GPT-4, which was released in March of 2013. Larger, more capable in general, and it also has a promise of multimodality, which is a term that will come up a few times together, Referring to the fact that GPT-4, in theory, is capable of thinking about reasoning through, I should say, reasoning through an input of both text and images. Now, in a later lesson, we're going to be making our own multimodal model from scratch that can take in an image and text and perform a generative task at the end. But this progression over five years shows a few step functions, one of which is actually from 2022. Right before in GPT 3.5, aka ChatGPT being released in 2022, included a new form of fine tuning. Well, I should say, I shouldn't say new, but a more, a now very popular way of fine tuning LLMs called reinforcement learning from feedback. And this was all done to perform alignment. We're going to talk about alignment more in our time. But to set the scene, an alignment task refers to how a language model encodes, interprets a input from a human, and actually responds in a way a human was expecting. That sounds vague, but when we talk to ChatGPT, or GPT-4, or Anthropx, Clod 2, or whatever, when we ask the LLM to solve a problem, we have a sense for what we're expecting it to say, right? Some structure, well, I said to translate it, so I'm expecting some French at the end, but that thinking about aligning an LLM's output to what a human actually wants is actually not that old. As an example, I have here an example of GPT-3 before and after alignment, because GPT-3 was one of the first, if not the first LLM to have this large process of alignment been done on it. So GPT-3 came out originally in 2020, the same year as T5, for context. GPT-3 in 2020 was nothing more than an autoregressive language model that had seen a large amount of data and had 175 billion parameters. It was large. It was the largest or one of the largest LLMs at the time. But, and this model is still available if you want to try it on their playground, if you ask it a question like, is the Earth flat, you get an answer technically, yes. But then it also starts going off the rails even more. What is the fastest way to travel from east to west? The fastest way to travel from east to west is by going south to north. Are two east-west roads the same? Yes, what is it doing? Well, what it's doing is its job, autoregressive language modeling. It sees is the earth flat and does not see a question. It sees text waiting to be auto-completed. That's all it sees. And this is often the first step in creating something like ChatGPT, is you do have to train it to complete a thought. Here is some text, what comes next? Don't think about what they're asking, just what do you think comes next? And according to GPT-3 in 2020, this is what came next. In 2022, OpenAI puts out a new version of GPT-3, which they informally called InstructGPT, that had a new form of fine tuning after the fact, so it was normal GPT-3 plus reinforcement learning, or alignment, I'll call it for now. And then, if you ask that model, is the Earth flat, it doesn't try to complete the thought anymore. It's now been fine-tuned to align itself, to expect a task, and give an answer. So is the Earth flat? No, the Earth is not flat, blah, blah, blah, blah, blah. So alignment is a pretty big step function in how we, humans, perceive the output of an LLM like GPT-3 or 4 or beyond. Now, GPT still has some pre-training. For example, GPT-2, which is open source and we will be using that as well, was pre-trained on about 40 gigabytes of text from the internet. And if you read their paper, which you are free to, but I'll quote it right now, the way they got that text for GPT-2, I should continue to say, is they scraped all of the outbound links from reddit.com, which received at least three karma, which doesn't really mean that much, which resulted in 45 million links and the subsequent data they scraped, took from that website, regardless of copyright or anything else. And that data was used to train GPT-2. Now, this is gonna come back to bite us in our code in later lessons, But I want to reiterate the fact that open source models like GPT-2 still have a history on what they were trained on. And that history can be quite old to put it mildly. Training from Reddit data in 2019 and before that is gonna be filled with, let's say, not so factual and fun information. And we're gonna see that rearing its ugly head in a later lesson when we attempt to, and succeed, in instruction aligning GPT-2, meaning we're gonna force and train GPT-2 to answer my questions just like ChatGPT and GPT-4 can do. Now, GPT-3 was pre-trained on 45 terabytes of text, which I should be clear, included the same Reddit data, common crawl, which is what T5 was trained on, and more. So there was a pretty huge explosion of how much data was used to train GPT-3. I should also say that the data used to train GPT-4 is relatively unknown. Because OpenAI, being a for-profit company, did not release that information, they don't have to, we don't actually know how GPT-4 was trained. There are also other models still coming out today and by the hour, frankly. A more modern one is called LLAMA2, which was released by Meta in 2023. LLAMA, which stands for Large Language Model Meta AI, our first acronym that does not pay homage to the transformer. Large Language Model Meta AI, or LLAMA, was, is one of the more capable open source LLMs. It was, it's actually a family of models. There's about, there's three of them, I should say, kind of, as we'll see. But they're trained similar to GPT models, but they're open source. So we can take them and host them locally. And because their license allows for a commercial use, can also use them on custom infrastructure for our own tasks. This is actually taken from the LLAMA2 website, which was trained on instruction data, supervised fine tuning, as they call it, and was aligned using RLHF, reinforcement learning from human feedback, a term we will come back to in much more detail, which they call human preferences, just like how modern GPT models are trained. So the family of 7 billion, 13 billion, and 70 billion models have two versions. The pre-trained version, which is simply just language modeling, autoregressive language modeling on 2 trillion tokens with a context length and input maximum size of 4,096 tokens. That model, which was pre-trained, is then fine-tuned for chat use cases, which again is just a way to let you know that you can talk to it, but also can answer questions and reason through tasks well. They do this through both supervised fine-tuning, which we will see in our code, and also through human preferences reinforcement learning. And they have a lot of data examples of that. We're going to be doing our own version of LLAMA2 with GPT-2 to kind of see the differences between what it looks like to fine-tune a 70 billion parameter model and a 100 million parameter model to see really where those differences start to show and the cracks that arise. Now, like GPT-4, the paper for LLAMA2 says that they're trained on 2 trillion tokens of data. Compare that to BERT's 2.5 trillion plus 800 million from BookCorpus, so a large amount of data here. But the paper never specifies what data it was trained on, just that it was from the web and mostly in English. So again, even in 2023, we are still training LLMs on mostly English data, which is why a lot of non-English speaking people don't find the most use out of these models like someone who speaks English might. And it also speaks to, potentially, speaks to some of the rising legal controversies on the types of data that can be used to train LLMs. This is not an accusation of any kind, and just when companies like Meta and OpenAI don't tell us what data something was trained on, you might get suspicious on, well, why not? Is it proprietary or is it just secret sauce Or could there be something else? We have no idea, but anyone who wants to use an LLM commercially, myself or you, or even privately should be aware of how these models were trained and if they were trained ethically and fairly. Now, another thing that will come up is the size of LLMs and thinking about, well, how big is too big? How big is not big enough? Or what size of an LLM should I be using? Again, BERT has about 110 million parameters, which is considered large, but massive compared to GBT-3 or beyond, which has at least 175 billion parameters. But size is not the only factor. In fact, in many cases, BERT, comparatively tiny, will achieve stronger results in coding type tasks, like classification or semantic retrieval, as we will see in our next lesson, then closed source massive OpenAI counterparts.