From the course: Introduction to Large Language Models
Chinchilla
- [Instructor] Over the years, the trend has been to increase the model size. Although we won't look at any of these models in detail. I'll mention them briefly now because we'll be comparing them later. So Megatron-Turing was released by a collaboration between Microsoft and Nvidia in Jan of 2022 that had 530 billion parameters. The Google DeepMind team released details about Gopher, which had 280 billion parameters, and it was one of the best models out there at the time. You can see that the model sizes were getting very large, and this was because of the scaling laws. But what if the scaling laws didn't capture the entire picture? The DeepMind team's hypothesis was that large language models were significantly undertrained. You could get much better performance with the same computational budget by training a smaller model for longer. Now, the way you would try and test out a hypothesis is to do a whole lot of experiments, and that's exactly what the Google Mind team did. They trained several hundred language models. They varied the sizes of these models from 70 million to over 16 billion parameters, and then they trained them with different amounts of data from five to 500 billion tokens. Based on their hypothesis, they then created and trained Chinchilla, a 70 billion parameter model on 1.4 trillion training tokens. Now, one of the key benefits of Chinchilla was that it was a smaller model. This meant less compute was required for fine tuning and inference. Now, remarkably, the 70 billion parameter Chinchilla outperforms Gopher, which has 280 billion parameters, GPT-3 with its 175 billion parameters, and Megatron-Turing with its 530 billion parameters, on a large range of language tasks. Now, let's think back to what we learned about scaling laws. One of the key insights in this paper is that if you were training a large language model and you get a tenfold increase in the computational budget, the majority of that should go towards increasing the size of the model and a smaller proportion towards the number of training tokens and training the model for longer. The DeepMind team behind the Chinchilla paper confirmed that it was important to scale the model size and the training data as suggested by the scaling laws. But unlike the scaling laws, the size of the model did not need to grow faster than the amount of training data. So for a tenfold increase in computational budget, the model size and the number of training tokens should be scaled in equal proportion. Let's take a look at how they demonstrated this. Firstly, you'll notice that both the X-axis and the Y-axis are not linear. The X-axis goes up by hundreds and the Y-axis goes up by tens. Now, the reason they've done this is because the quantities that are being measured increase by large amounts, and you'd not be able to show this relationship easily on a graph otherwise. Now, on the X-axis, we have flops, or floating point operations, and these are a measure of computation. So the further you go across on the X-axis, the more expensive it is, because you require more computational resources. On the Y-axis, we have the number of parameters for the models, so that's the size of the model. And as we go up, the models are getting significantly larger. So on the graph you can see that we have the GPT-3 model that's represented by the red star, the Gopher model that's represented by the yellow star, the Megatron-Turing model given by the purple star, and finally, the Chinchilla model represented by the green star. Chinchilla outperforms Gopher, GPT-3, and Megatron-Turing. This means that it's better at a wide variety of tasks, like reading comprehension and question answering, answering questions on a variety of different topics at high school level from history to chemistry to astronomy. Now, what's interesting is that it was trained on the same amount of compute as Gopher. Now, I can confirm that because if I draw a vertical line from the green Chinchilla star down all the way down, Gopher would also be on this line. And this means they were trained having around the same amount of compute. So the takeaway from this is that Chinchilla is significantly smaller than the other models, and you can see this on the Y-axis, but it was trained having around the same compute, and yet it outperforms them all. Let's take a quick look at the data that the models were trained on. While most of the large language models have been trained on around 300 billion tokens, Chinchilla has been trained on 1.4 trillion tokens, which is almost five times as many tokens as the other large language models. This then begs the question, are the massive language models that are being trained oversized? So let's say we use the Gopher model as our baseline. Remember that the Google DeepMind team released Gopher, which had 280 billion parameters. For their compute budget, the optimal model parameter size should be 67 billion parameters, and the number of training tokens, 1.5 trillion tokens. Now, we know that the number of training tokens for Gopher was 280 billion parameters. This means that their training budget should have actually been 17.2 times more, and they would've required 5.9 trillion training tokens. It doesn't mean that you can't train these large models, it's just that these language models have not been trained with enough data. The DeepMind team designed an interesting experiment to compare their findings with the scaling laws. Given a compute budget of a certain number of flops, determine the number of parameters required and how much data is required to train it using the scaling laws prescribed by OpenAI and the ones determined by DeepMind. Whichever language model results in the most performant model is better. Now, the way you can compare and determine which model is better is if you give it a whole load of tests from generating text to question answering to high school questions on a variety of subjects, and you can see which one gets the most correct answers. With the scaling laws from the OpenAI team, the Flops budget recommended a 4.68 billion parameter model. DeepMind's approach recommended a 2.8 billion parameter model. So the results from using the Chinchilla paper are in blue, and the results from using the original scaling laws are in orange. The Y-axis is the training loss, so the lower the better. The X-axis represents the number of training tokens. You can see that if we stopped at the number of training tokens recommended by the scaling laws, the orange curve is lower than the blue one, and it would appear that it has a lower training loss, and so it's a better performing model. However, because the Chinchilla paper recommends that the model needs to be trained on more data, we can see that the blue Chinchilla curve ends up with an overall lower training loss compared to using the scaling laws given in orange. Similarly, if we were to just plot the training loss versus the number of training flops, or floating point operations, in the diagram on the right, you can see that we get a lower loss using the Chinchilla method in blue versus the scaling laws in orange. So it turns out that you can end up with a more performant model using a smaller model with more training data. So let's wrap up the section by adding Chinchilla to our list of models. I'm also going to add Megatron-Turing and Gopher, even though we've looked at them briefly. Our biggest takeaway is that up to this point, large language models are significantly undertrained. And from the table, Chinchilla is trained on more than four times as much data as any other large language model. It's the smallest, but also has the best-performing results. Lessons from Chinchilla have helped me at work. At a project for a customer, we were looking to train a model from scratch on a European language. We looked at the size of the model and the size of the training data set we had, and quickly realized that we would not get a performance model because we didn't have enough training data. So instead of kicking off a training run, we focused our efforts on getting more training data.