From the course: Introduction to Large Language Models (LLMs) and Prompt Engineering by Pearson

What are large language models?

where we are today. Let's start in 2001, where the first popularized deep learning for language modeling comes about. Around this time, researchers and practitioners alike are starting to understand that an emergence language processing with a combination of deep learning and Word2Vec at their disposal. This is also around the time we start to see people talking, again, not for the first time, but more and more, about biases and ethical fairness of language modeling as it relates to deep learning. The next step function comes pretty short after. Between 2014 and 2017, we start to see a rise in popularity of a specific kind of deep learning architecture as it relates to natural language processing. This is referred to as a sequence-to-sequence model, which is not new necessarily for this time. However, around 2017, in the latter half of that era, we start to see the use of a new form of mathematical mechanism called attention being applied to language models that are deep learning models, which are sequence-to-sequence. I know that was a bit of a mouthful, but the real takeaway here is that between 2014 and 2017, researchers, and again, practitioners alike, are starting to recognize the power of a certain mathematical mechanism, attention, especially as it relates to natural language processing. Which leads us to 2017, which is our major step function that really this whole video and everything that I'll be talking about is based on. And this is the introduction of a new deep learning architecture called the Transformer. The paper in 2017 was actually called Attention is All You Need. So the title of the paper wasn't even Transformer. It was about the attention mechanism. And it postulates that a deep learning mechanism may only require a massive amounts, relatively speaking, of attention mechanisms, as opposed to what former state-of-the-art models like RNNs and LSTMs were, which did not use attention unless you specifically wanted to add them to it to enhance their power. So in 2017, we start to see this new architecture coming about called the transformer. And the transformer, with an image here, is a sequence-to-sequence model, which means it takes in a sequence of data and outputs a sequence of data. Now, I'm being pretty vague there, but in our terms, a sequence of data will almost always, almost, always be, tokens or subunits of words and phrases thrown in for natural language processing purposes. So in our case, the transformer, for the most part, will be taking in tokens or sequences of tokens and outputting sequences of tokens. Now, this transformer model ends up being the parent of most large language models that we know today, like GPT, or BERT, or T5, and many more, which we will see throughout the course of our video series. I promised you a definition of language modeling, and I'm going to give you one. But first, an example. If I were to show you this phrase, If you don't blank at the sign, you will get a ticket. Your brain's probably already done the math or the natural language processing here, but if I ask you to fill in the blank, you might have a few options. Now, I'm gonna guess that most of you probably thought stop. If you don't stop at the sign, you will get a ticket. I'll also bet that some of you, if not the majority of you, also thought to yourself, well, it could be stop, it's probably stop, But that's not the only word that would fit. I could say yield, and it would technically still be, grammatically and more importantly, semantically correct. The sentence would still make sense if you said yield. It's just not as common of a term as stopping at a sign. at processing entire sequences of tokens to perform a task on an entire sequence, as opposed to simply trying to speak back one word or one token at a time. As we will see many times, depending on the task we are trying to perform, we will either be turning to an autoregressive or an auto-encoding model. And I will be sure to point that out along the way as to why I am choosing a particular type of language model for a particular task. So that's language modeling in a nutshell. But how large is a large language model, or an LLM, as I'll probably refer to them from now on? LLMs are language models, period, and then some, with many parameters, or weights and biases, or if you're familiar with deep learning at all, it's just the general size of the model on disk and in memory. Now, there is no textbook definition for how large a large language model has to be before we consider it large. But I'll throw a floor at you and say, at least 100 million parameters or more, I'm going to consider a large language model. Now, this is actually quite puny compared to some of the models coming out today, which have billions of parameters. Some of them are rumored, rumored I should say, to have trillions of parameters. But as a floor, let's say that 100 million is the minimum number of parameters a model would have to have to be considered large. This would actually include the original versions of GPT-1 and the original versions of BERT, for example. Massively large language models, like CHAT-GPT or GPT-4, or LLAMA-2, have billions, or tens of billions, or hundreds of billions, or again, rumored trillions of parameters. And these are pre-trained on huge data sets of knowledge. They basically read and perform language modeling on huge amounts of data. And we'll start to see the scale of those large amounts of data quite shortly. Now, LLMs are trained on this data in order to capture complexities and diversities and nuances of human language. LLMs are known to perform a wide range of tasks from classification to generation, if you're familiar with GPT models. And frankly, they do so with high accuracy, fluency, and with style in some cases. To use these massively large LLMs, we generally have to, according to the company, rely on their Playgrounds and their APIs. Playground is not a new word by any means, but it has changed in connotation these days, especially among developers of LLMs, to think, to convey a graphical interface in which a user like myself and like you would be able to interact with the LLM or chat with it, talk to it, do something with it in some simplified graphical interface. APIs are APIs. They're going to be programmatic interfaces to the LLM that we will use in spades, frankly, throughout our time together because there's gonna be a fair amount of code happening here. You've probably seen Playgrounds before, but I'm gonna show you a Playground that is still active today for GPT-3. Now, I should be clear at the time of recording, GPT-3 is no longer considered the state-of-the-art GPT model from OpenAI. That being said, as we will see in a later lesson, it still has some pros to it, some intricacies around it that are actually gonna be beneficial to some application. So to that end, this is what the GPT-3 playground looks like. We have this large area in the middle where you're supposed to write your prompt, your input to the model. We'll talk much more about prompts shortly. And on the side of the playground, we have inference parameters. These are levers or sliders or dropdowns that we can change in order to alter how the output comes back to us. And we are gonna get much deeper into those parameters in later lessons. But one of the big takeaways that I want everyone to walk away with is how we use chat GPT just on their website, on OpenAI's website is similar to, but can be wildly different to how we are allowed to use GPT 3.5, which is the technical term for chat GPT, or GPT-4, or any other LLM, frankly, through their Playground and API. We don't have access to these parameters like temperature, whatever that is, we'll talk about that later, or the maximum length, or top P, all of these things don't exist on their graphical user interface that most people use, but are highly available to anyone who wants to use them through the Playground and API. you're probably familiar with more of a UI like this, where chat GPT, and again, we will be using OpenAI for many of our examples, but I would be remiss not to mention other LLM providers like Anthropic or Cohere, which we will talk about later on as well. But you're probably familiar with more of an interface like this, where it's more chat-like, right? You say something, it says something. You say something back, it says something back. This conversational interface has become the new playground that many people are familiar with. Because when they think about an LLM, they're usually thinking about it like a chatbot, which is correct. They are specifically trained to act like chatbots. And that's great. And frankly, we will use them as chatbots in our lessons together. But we are also going to be using LLMs as what they really are under the hood, reasoning machines. LLMs are extremely powerful at taking in an input or a instruction from a human like myself, we call these prompts, and outputting an extremely tailored and personalized response to that prompt. And we'll see in later lessons how nuanced and specific and personalized they can get. We'll be outputting JSONs, we'll be outputting Python code, we'll be outputting the same response in different languages, styles, personalities. So it's not just an assistant that you can talk to, it is really under the hood, a reasoning machine. And part of that reasoning can be used to have a chat. Now, as I've alluded to and will continue to allude to and start to be more explicit about, is trade-offs between different LLMs. As I've mentioned, we will be focusing at least in the closed source, the for-profit side of LLMs, we will mostly be focusing on open AI with references to other companies along the way. But a lot of the times, we're also gonna be talking about open source models. And that universe is much bigger than what is for-profit closed source, like GPT-4. While we think about all of these LLMs though, we start to think about the trade-offs between them. Auto-encoding models like BERT, which is by the way, completely free and open source, these models and family of models, I should say, are extremely quick at encoding semantic meaning for understanding tasks, right, that NLU. But they cannot, at least off the shelf, generate free text. They just are incapable. Autoregressive models sometimes refer to as causal models, just so you can see that and know what that means. But autoregressive models, like the GPT family, are extremely slower, relatively to auto-encoding models, at least. They are much slower at processing text if you compare them for apples to apples, different same parameter sizes. They're slow to process text. But they have the power to speak and generate free text responses for generation tasks. This is, by the way, the term generative AI, which I frankly will not be using too much because it's not as specific and technical as I would like. But this comes from the fact that most LLMs that we consider generative are generating tasks and LG models like GPT. So this is where that term really comes from. And there are also combination models like T5, which can both encode quickly and generate text, but they are generally harder to train and require more data because you are training them to do two things at once, which is why most companies focus on one or the other. OpenAI, at least at the time of recording, does not make auto-encoding text models. They don't exist. They don't make those. at least they don't make them in an open source sense. And if they use them, they're used in a larger task like for generating images, like for DALI 2, which we won't be talking about too much, but part of that system is an encoding text part. But that part is part of a larger system and not necessarily used unless a developer wants to rip it out and use it for something else on another task. So most people will pick autoencoding or autoregressive, and fewer companies will tackle the combination. But if they nail that combination, then companies like Google, who makes T5, can really show larger demonstrations of capabilities of models at a smaller size.

Contents