From the course: Advanced Quantization Techniques for Large Language Models

Quantizing your first Transformer model

In this demo, we're going to quantize our first transformer model and see concretely what changes when we move from a standard FP16 model to an 8-bit and a 4-bit quantized version. We'll work with GPT-2 using Hugging Phase Transformers and the Bits and Bytes library to handle quantization under the hood. The goal here isn't to just flip a few flags, but to build intuition. What do we gain in memory? What do we lose in quality? And how does latency change when we do these things? All right, so let's start at the top of the notebook. First I have a check here for a GPU. If torch.cuda is available returns false, we'll raise an error and remind you to switch to a collab GPU runtime. You can do so by simply clicking here and then switching to a GPU runtime by clicking on change runtime type. For now, I'm connected to a GPU. Next, I have a bunch of dependencies here, including transformers, accelerate, bits and bytes that's installed, and then a few helper functions that are defined here. Mainly, the most important piece is the model ID, which is GPT-2, and the device, which is CUDA, meaning we'll run things on the GPU. The three helper functions here for memory are getProcessMemoryMb and getGPUMemoryMb, essentially use psutil under the hood and PyTorch's CUDA API to report CPU and GPU memory in megabytes. This is going to be really useful for us to see what's the utilization looking like. I also have describe memory. It's essentially a tiny wrapper that prints the GPU and CPU utilization with a label, so it's clear to us what's going on. Next, I have compute perplexity. This function essentially is a quality metric. We pass in a model, the tokenizer for that model, and some text. The function returns the model in evaluation mode, uses the input tokens as labels for causal language modeling, grabs the loss, and exponentiates it to get perplexity. Lower is better, meaning that the model assigns higher probability to the text, so it understands it better. Next, I have time to generate function. Essentially, it's a helper function that measures generation latency. Once you run this, you'll have all of the functions that you need to be able to compare different quantization configurations in a fair measurable manner. First, let's take a look at the model in FP16 mode. To do so, I'll clear any leftover GPU memory first so that we have free GPU and then call model FB16 equals auto model for causal LM and specify the data type, which is FB16. Then I'll be using my model data type function and my get memory print function to essentially print the memory footprint. So let's take a look once we run this. I already have the output here. So as you can see, we're using a bunch of CPU memory and a bunch of GPU memory. The data type is float 16, and the approximate model footprint is 249 megabytes. Next, we again go through the process of clearing GPU memory and finally, we load the 8-bit quantized model. At this point, we've removed the previous model from memory, so we'll be printing out fresh stats. To load an 8-bit model, you simply use BNB 8-bit config. So we simply specify load in 8-bit equals true to the bits and bytes config. And it should load the model in 8-bit for us. There's also a few other parameters that you can specify. The LLM int 8 threshold controls how outliers are handled when it comes to extreme values. Once we load this, we can simply call from pre-trained and specify the quantization config. This is an important step here. You can specify a custom config this way using bits and bytes. Once we load this and run this, which I've already done here, you can see the CPU and GPU memory utilization. The approximate eight-bit footprint is 168 megabytes compared to earlier 249 megabytes for the 16-bit model, which is a good reduction. Next, we'll repeat this process for the 4-bit model. In this case, I again specify a bits and bytes config, then specify load in 4-bit equals true, and then compute in FB16 for speed. However, we quantize down to normal float 4 for the Gaussian weights, essentially weights that are normally distributed, which is normally the case for GPT-2 like models. Once we do this, we can again pass in the quantization config to our auto model for causal LM and load the model in. So let's take a look at the memory here. The four bit footprint approximately is 127 megabytes. This is pretty close, but still lower than the eight bit megabyte usage, which is interesting. Now, if you remember, we had a perplexity calculation. We can compare quality by using a short sample text about quantization and calling our compute perplexity helper function on each model. So I have some sample text here, which you're free to change, which specifies what quantization does. And then what we'll be doing is calling the compute perplexity function with the model, tokenizer, and the sample text as the inputs. When we do this, we can see there's different perplexity numbers computed. FP16, which is our baseline, has the lowest perplexity amongst all of the quantizations. And an interesting thing you'll notice is that the quantization increases linearly. However, the decrease in perplexity as we go down in quantization is different. 8-bit, for instance, has higher perplexity than 16-bit, which is understandable. However, 4-bit has even higher perplexity than 8-bit, meaning we have most information loss when it comes to 4-bit quantization. Finally, I have a few generations here. We can simply use our time to generate function to compare the time it takes to generate the output. So, I have some starter prompt here, which you can again change. And when we run this, we can see different generations happening. For each model, we see the output here, and then the time it took to generate the output. As you can see, FP16, which is our baseline, actually generated quite close to the 4-bit quantization that we did later, which is interesting. FP16 is still the fastest when it comes to generation time, and 8-bit is actually quite slow. There are a bunch of reasons for this, which we won't get into now. However, you should keep in mind that generation time is only one parameter. As we saw earlier, the model footprint reduced quite a bit. We were close in perplexity. And we also reduced the overall footprint and generation time. All of these things combined, when you look at, on average, make quantization quite useful to run your models on edge devices and other such devices.

Contents