From the course: Complete Guide to Evaluating Large Language Models (LLMs)
Metrics for fine-tuning success
From the course: Complete Guide to Evaluating Large Language Models (LLMs)
Metrics for fine-tuning success
- At this point, you're probably sick of the app reviews dataset, but I'm going to ask you to hang in there for a little bit longer. We've already talked about how loss and accuracy are going to be ways of evaluating fine-tuning, particularly while fine-tuning is actually happening. And that's still true. So our loss wants to go down, or we want loss to go down rather, and we want metrics like accuracy to go up. We've also, again, I've hope I've beaten the idea into your head that at the end of the day, it's the testing set metrics that matter the most. What is the test set accuracy, the test set recall, and precision and all that. However, fine-tuning is a pretty vague term that opens up a lot of possibilities. We're going to start by first talking about, well, how do you make your fine-tuning and evaluate if you're fine-tuning is working the most optimally. And to understand that, we'll have to start to understand some different techniques for performing fine-tuning. One of the most useful, I would argue, techniques in fine-tuning is called dynamic padding. And the idea is, long story short, when you fine-tune a model, you're passing in different batches of data at any given moment. The model expects that all of the different batches of data or rather, the individual data points within the batch are of the same length. So there's two ways to solve this. You can either pad everything to be the exact same length at the very beginning, or you can do it on a per batch basis, meaning at every batch that you're about to pass into a model, you pad them all to be the same length as each other. But you might have different batches being different sizes from batch to batch. Dynamic padding, which is the latter, padding it at the batch time right before you're about to pass it into a model, will often reduce both memory and often time it takes to do the fine-tuning. Now, a case study from my book will pin different techniques of fine-tuning together to try to show the difference in the fine-tuning. Now, one thing to note going in is at the end of the day, all, and in this case there's six different experiments here, all six experiments yielded the same model performance at the end. So at the end of the day, they all yielded the same model effectively. So what we're evaluating is the actual fine-tuning process itself, and I do this through four different techniques by changing the batch size, which is how many data points go into the model in any given moment. Something called gradient accumulation, which basically says, "How often should I be updating the model's parameters? How many batches should I wait until I actually do the updating of the model parameters?" Mixed precision, which is basically saying if I change, or lower the precision meaning like 4-bit verse 32 bit, if I change the precision of our gradients, I should save some memory and dynamic padding, the thing I just mentioned. Now, I could talk more about fine-tuning techniques for a long time, but I won't. But I will tell you is that by changing around which techniques you use, you can increase or decrease the amount of time and memory it takes to fine-tune. For example, on the left is vanilla training, meaning one batch, a batch size of one, which is kind of ludicrous, but just to show the example here, a batch size of one at a time and doing nothing else differently. On, for this example, it took 80 seconds to run the fine-tuning. Clearly it was a very toy example, but still the point still stands, 80 seconds to run and roughly 25% of the GPU memory was used in this example. If I changed the batch size, so this is the next set of bars, the second set of bars, if I change the batch size to four and do nothing else, look what happens. Memory jumps to 40%, but time jumps down to 40 seconds. So okay, changing one little fine-tuning step, which again yields the same model in the end. So accuracy, whatever, all those metrics are the same in the end. But if I have limited memory, GPU memory, I need to keep this in consideration. And if I want to be able to run, in example at least, twice as many experiments, I might increase batch size to four, halving the time it takes to do fine-tuning, but I want my memory to come down. Different ways of bringing the memory down would be things like mixed precision, dynamic padding and grading accumulation. So if you look at all the way to the end, if I employ all of these metrics, or not metrics, sorry, all of these fine-tuning techniques, I get four times faster fine-tuning from 80 to 20 and roughly the same GPU memory. So if I'm okay with using that much memory, but I just wanted my fine-tuning to be faster, I needed to employ a mix of fine-tuning techniques. It's not just going to be which one of these techniques. Often it has to be all of them to get the final benefit that you are looking for. So I might increase the batch size, which will increase the memory usage, but then I might turn on gradient accumulation, mixed precision and dynamic padding, which will save on memory. So I've really come at a net zero on memory, but I've quadrupled the speed of my fine-tuning. So now we're not just evaluating a model on a task, we're actually evaluating our own experimentation process in the meantime. There are also other ways of altering our fine-tuning to get better or worse results. One example is freezing. Freezing is actively deciding to not update certain parameters during fine-tuning in an attempt to save memory and time generally at the cost of performance. So the more you freeze, the worse your model will become in the end, but you can drastically increase the time it takes to do that fine-tuning. So again, if it comes down to we need to run 20 experiments and we don't have much time, freezing it would be one way that you could basically cut your performance at the benefit of running more experiments. It's always a push and pull when it comes to fine-tuning techniques like this. Then there's the act of using fine-tuning to create more efficient models. So for example, one of those tactics is called model distillation, where before even LMs existed, model distillation was a method in machine learning where you would train a smaller, more efficient model, which we often refer to as a student model, to reproduce the behavior of a larger, more complex model, often known as the teacher model. The idea is you first train a large model and then you distill that large model's learning into a smaller version. There, you can read more about distillation in my book, I have here the two types of model distillation, task-agnostic, and task-specific. I will instead immediately turn to an example of doing distillation where in the book I also use a data set called GoEmotions, which is basically going to be classifying 58,000 Reddit comments for 27 emotion categories, the most commonly-classified emotion as neutral. Then we have approval, annoyance, gratitude, disapproval, so on and so forth. So the idea is classification. Given a piece of text, which emotions, and it could be plural, which emotions are being exhibited in this piece of text. A surprisingly difficult task for a lot of LMs for what it's worth. Now, again, you can look at the full code on how to do all of this. Frankly, it would take me roughly an hour to walk through the code itself. But I can show you the final results. In that, I have three models here. The teacher model, the task-agnostic model, and the task-specific model, spoiler alert, task-specific distillation tends to be much more performant than task-agnostic distillation, where you are basically updating the model with more inputs through task-specific distillation. But again, the point here is that our student model, our smaller model is exhibiting even better performance than our teacher. So we're using the accuracy to test whether, or not the teacher is better or worse than the student. So accuracy is telling us the student has surpassed the teacher, great, but what also, more than just accuracy, which again by now is the metric of choice for a classification for most people, we can also look at the student model, which had a better accuracy, also has a better memory consumption and a better latency. The student model, the task-specific model consumes six times fewer memory units than the teacher model. The task-specific student model also has a much better latency as the bottom graph is showing. It is roughly six, seven times faster than the teacher model, again, because it's smaller. So our fine-tuning cannot just yield better models in the sense of task performance, but they're also going to hopefully yield more efficient models, even if the specific, the task-specific student model could be worse slightly than the teacher as is relatively common in distillation, it might be worth it to have a six times smaller, efficient, faster model Depending on where this model is going. If it's going on a browser, or going on a phone, you might need to use distillation to get a smaller model that might not be as good as the bigger one, but it is much more efficient. So fine-tuning techniques and tactics don't just yield hopefully better task performance, but they should also work to yield more efficiency in our systems. Both the actual act of fine-tuning through things like dynamic padding, batch size, gradient accumulation, but also by using fine-tuning to yield a net-new model, which is supposed to be as accurate or as performant, I should say, but consumes a lot less memory and is a lot faster simply by the act of fine-tuning itself.