-
Notifications
You must be signed in to change notification settings - Fork 13.9k
Quantization improvements for k_quants #2707
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
* Q3_K_S: use Q5_K for 1st 2 layers of attention.wv and feed_forward.w2 * Q4_K_S: use Q6_K for 1st 2 layers of attention.wv and feed_forward.w2 * Q2_K and Q3_K_M: use Q5_K instead of Q4_K for 1st 2 layers of attention.wv and feed_forward.w2 This leads to a slight model sized increase as follows: Q2_K : 2.684G vs 2.670G Q3_K_S: 2.775G vs 2.745G Q3_K_M: 3.071G vs 3.057G Q4_K_S: 3.592G vs 3.563G LLaMA-2 PPL for context 512 changes as follows: Q2_K : 6.6691 vs 6.8201 Q3_K_S: 6.2129 vs 6.2584 Q3_K_M: 6.0387 vs 6.1371 Q4_K_S: 5.9138 vs 6.0041 There are improvements for LLaMA-1 as well, but they are way smaller than the above.
For the same model size as previus commit, we get PPL = 5.9069 vs 5.9138.
With it, we get PPL = 5.8828 for L2-7B Q4_K_S.
Smaller model, lower perplexity. 7B: file size = 2.632G, PPL = 6.3772 vs original 2.670G PPL = 6.8201 12B: file size = 5.056G, PPL = 5.4577 vs original 5.130G PPL = 5.7178 It is mostly Q3_K except for tok_embeddings, attention.wq, attention.wk, which are Q2_K
|
would it be possible to add some kind of meta information the the gguf to be able to determine if it was generated using the improvements of this pr. maybe generic like date/builnumber or more specific like k-quants-v1.1 or something. (whatever makes sense, but gguf now has easy extensibility)
I loled |
Currently main will print the number of tensors of each quantization format. |
Yes, but one might decide to change the quantization strategy, so even though all tensors are quantized with the same type, result is still different. For instance, in this PR I have changed As it stands, when running with this PR, |
|
#2710 adds the Feel free to extend further the meta info with version/date/commit/etc. As long as the added KV info is optional, we can extend it anyway we like |
|
@TheBloke - in case you didn't see this. Might be a reason to hold off on conversion for a bit if you haven't started yet. |
|
Any benchmarks on models larger than 7B? |
|
Looks fantastic! 6.0x for Q3 is amazing |
Yes, I did comparison for both 13B LLaMA's. Bud development was done on a branch that did not have the GGUF changes. When I was ready to submit the PR, I rebased on master, which brought in the GGUF changes, which changes the perplexity results. The change is actually quite dramatic for LLaMA-v2-13B: |
|
Massive improvement with Q2_K it looks like |
|
The part of the graphs that surprises me the most is that q4_1 and q5_1 have higher perplexity than their q4_0/q5_0 counterparts on LLaMA-v2. @TheBloke It makes me wonder whether you should even bother providing q4_1/q5_1 quantizations for LLaMA-v2 models, since they are bigger, slower, and lower quality. Maybe you could at least make a note on the READMEs that they are probably not useful. |
|
the values in the help of the quantization tool where not updated. @ikawrakow |
|
q4.x or q5.x should be banned already as qk models are just better in everything .... |
|
I ran some performance tests. The most noticeable change is Q2_K, which is now 40% slower.
|
That seems surprising since this is a backward compatible change. You should be able to quantize with this version and then test with a version before the pull was committed - if you do that, you still see a large performance difference? |
|
I cannot confirm a change in performance for |
|
Hey guys, a couple of quick questions: When I run Is it correct that Q6_K has better perplexity than Q8_0? In which case there'd no reason to include Q8_0 any more? Also I assume it must be a measurement error that Q6_K has better perplexity than FP16? :) Like one figure is from before GGUF and one after or something? Would that also affect the Q6_K vs Q8_0 figures? There used to be some text information displayed when |
|
I suggest all |
|
A PPL difference of +/- 0.001 is within the statistical noise for the amount of tokens in Wikitext. In the case of LLaMA-v1-7B it happens that |
|
The +/- ppl statistic may be is confusing for normal users to understand. Printing the real ppl may be better? |
As a user, what could you do with the raw ppl number except for subtracting it from some other value (like unquantized) to get a relative value? |
|
At least print the real PPL value of the unquantized F32. |
|
OK thanks for the explanations! What is the feeling regarding Q6_K vs Q8_0? Is there enough of a statistically significant difference between Q8_0 vs Q6_K to make it worthwhile including Q8_0 still? For example do you have a Q8_0 figure for the Llama V2 7B case you mentioned? |
Seems pretty reasonable, though I think it's still kind of hard for the user to do anything with. I was actually the one that added the additional information to the quantize tool and my first pass included a lot more stuff. Some of the stuff from this post: #406 (comment) (note, the values are outdated) One metric I think is actually pretty useful is % PPL increase relative to going from a 13B to 7B model. I think users that have messed with LLMs a bit will have some conception of the difference between a 13B and 7B model, so saying "this increases perplexity 50% as much as going from 13B to 7B" means more than
The main use case I can think of is people who want to keep a high quality version of the model to requantize but don't want to keep the full 16bit model around. I.E. When using the quantize tool with |
Q8_0 could potentially be significantly faster than Q6_K if properly optimized I'd think (especially if we did INT8 activations instead of converting to FP16). But I might be mistaken. |
Yes. While experimenting with the k_quants refinement (PR #2707), at some point I tried using |
Theoretically a new Q8_something could be added that does this extra work and is always better than |
|
|
OK thanks very much! I will keep making Q8_0s then. I'm definitely dropping Q4_0, Q4_1, Q5_0 and Q5_1. |
Theoretically, yes. In practice, not so easy to make sure that it always beats (or is at least the same) as |
Isn't that only the case for consumer hardware? I'd expect tensor core INT8 inference to be significantly faster on A100 than the current setup with quantized mulmat. |
On my P40, Q5_0 is about 9% faster at token generation than Q5_K_S for a negligible difference in perplexity and file size on LLaMA-v2-7b. Could you keep that one at least? |
Consumer GPUs support INT4 and INT8 inference on tensor cores as well. |
|
OK, I have most of the LLaMA-v2-70B results now. Did not (yet) do As table:
|
Based on the 13B results, I guess we can expect the difference between the previous version and this pull to be very small so not really worth comparing? |
|
The PPL for LLaMA v2 70B F16 is Here is full Metal run. Not sure why the estimated time is so off (~4 hours). It took just 1.2 hours |
Anything like a random CPU usage spike or swapping while the first block is running will throw off the whole time estimation calculation since it's only based on the first block time. I've seen it happen from time to time. |
I'm observing this on a regular basis. The very first time you load a model, the time estimate is off by a sizable margin. If you stop the process after getting the time estimate, on next run with the same model the time estimate is fairly reliable. |
Sorry, I missed this. I have a Tesla P40 24GB. I was comparing commit bac6699 ("PR") with commit 519c981 ("before"). I used ehartford/dolphin-llama2-7b because that's what I had on hand at the time. I compiled with I then wrote down the "eval time" t/s. I just re-quantized and re-tested just to make sure, and for Q2_K I get 55.67 t/s before this PR and 33.32 t/s after - a difference of less than 0.2% from the previously provided numbers. Johannes has a few P40s, so he should be able to reproduce my results. |



This PR improves k_quants perplexity scores by tweaking the quantization approach and quantization mixes. It is fully backward compatible (but obviously one needs to re-quantize the models to take advantage of these improvements).
The most significant gains are for
LLAMA_FTYPE_MOSTLY_Q2_K, where perplexity is reduced by a significant margin while slightly reducing the model size (e.g., from 2.67 GiB to 2.63 GiB for 7B). See graphs below.Significant improvements are also observed for
LLAMA_FTYPE_MOSTLY_Q3_K_MandLLAMA_FTYPE_MOSTLY_Q4_K_Sfor LLaMA-v2-7B. This comes at the expense of a slightly increased model size (e.g., at 7B, 3.59 GiB vs 3.56 GiB forQ4_K_Sand 3.07 GiB vs 3.06 GiB forQ3_K_M).Other quantization types / models are slightly better for LLaMA-v2 (but the change is much smaller compared to those mentioned above), or basically the same for LLaMA-v1.
Note on
LLAMA_FTYPE_MOSTLY_Q2_K: strictly speaking, this is now mostly aQ3_Kquantization. All tensors are quantized usingQ3_K, except for attentionKandQ, which areQ2_K, andoutput.weight, which isQ6_Kas usual. I considered naming itLLAMA_FTYPE_MOSTLY_Q3_K_XSor similar, but given that this model is smaller and better than the previousLLAMA_FTYPE_MOSTLY_Q2_K, so the existingQ2_Kmodel would have been useless in comparison, I decided that it is simpler to just re-use theLLAMA_FTYPE_MOSTLY_Q2_Kdesignation for this new quantization mix.The following graph shows perplexity vs model size for the LLaMA-v2-7B model and a context length of 512. Black dots/lines are for current master (i.e., after the merge of the
GGUFrelated changes). Red dots/lines depict the results of this PR. Results forQ4_0, Q4_1, Q5_0andQ5_1on current master are shown in blue for comparison. The perplexity of thefp16model is 5.7963. The newQ6_Kquantization arrives at 5.8067 (so, 0.18% higher) compared to 5.8118 (0.27% higher) on Master.The following graph is the same as the above, but with a smaller plot range to better appreciate the perplexity differences in the 4-6 bit quantization range.
Similar to the above graphs, but for the LLaMA-v1-7B model.