From the course: Introduction to Large Language Models

Making large language models follow instructions

From the course: Introduction to Large Language Models

Making large language models follow instructions

- [Instructor] We've seen the problems with just a base large language model. It just doesn't follow our instructions to create a shopping list out of the box. So how do we go about creating a large language model that will follow instructions we give it? In 2022, the Open AI team released a paper called "Training Language Models to Follow Instructions with Human Feedback," which is still the industry standard. There are two components to this training, supervised fine-tuning, and RLHF, or reinforcement learning from human feedback. Let's head over to the paper and take a look at the supervised training in the diagram on the left. The Open AI research team would create a prompt, for example, "Explain the moon landing to a six-year-old," and then a labeler, so that's a person who's skilled with working with text data, would then write out what the model should produce as output. So for example, they may include details like it took place in 1969, the spaceship used was Apollo 11, and two American astronauts, Neil Armstrong and Buzz Aldrin, became the first human beings to walk on the moon. They would typically have tens of thousands of such questions, and after this, researchers will typically fine-tune a model. Now, what they mean by fine-tuning is that they would pass both the prompt and the expected output from the label of each prompt to the large language model, and train it. This means after some training for a given prompt, the large language model gets better at producing output that resembles what a labeler would have written. So that's what's known as the supervised training part. So let's move on to RLHF. Let's say there's another task, for example, "Summarize the following news article." Now, let's say the news article is about research on parrots and the sounds they make. They would get this fine-tuned model to generate five different summary outputs of the news article. The labelers would then rate each of the five summaries, rating them between one and seven, using an interface like this. Often researchers also want to make sure that the models follow instructions and do not generate toxic content or fabricate information, so the labelers had to give yes no answers to questions that address this, over on the right. The labelers also answered questions like, did this model fail to follow the task, and did the output contain inappropriate content, and so on. And they ended up with a ranking of the five different summary articles that looked something like this. This is the labeling interface where the bottom three boxes are the summary of news articles on parrots that they've already ranked, and the top two are the ones that the labeler will work on next. We're now in the middle at step two. So this ranking of the articles is used to train another model called the reward model. Now the reward model takes this input, the rankings from the five or so text generations from the model, and returns a number. Now, this number represents the labeler's preference. This reward model could be fine-tuned to take an input ranked text, and output a number that indicates how well the labelers perceived it. The final step over on the the right was to use reinforcement learning to optimize the original language model using the reward model. So if the model generates a text that follows the task intent and is not untruthful or toxic, it's rewarded. And that's the crux of reinforcement learning. This way, a model is encouraged to generate text that would receive positive human feedback. PPO, or proximal policy optimization, is the algorithm used to calculate the loss, which is used to update the original language model. So let's take our earlier example of a shopping list and see the difference with a model that has been trained with supervised fine-tuning and RLHF. And so I could use text-davinci-003 as an example, and let me go ahead and remove all of the prompts and say, "Write a shopping list," and let's see the output that we get. And this is excellent. This is exactly what we would want. We've got bread, milk, eggs, cheese, ground beef, cereal, and so on. Alright, so we've looked at what supervised fine tuning and RLHF does. This combination of fine-tuning and reinforcement learning from human feedback has produced large language models that are much better at following instructions. Go ahead and head over to the OpenAI Playground and try and give a model a couple of tasks and see how it does.

Contents