-
Notifications
You must be signed in to change notification settings - Fork 14.7k
Refactor ggml.c for future tensor types #1001
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
I think this is a good change, but I am concerned that we won't be able to change QK without breaking backwards compatibility, and if we ever want to support a different value it will require much deeper changes anyway. I think this could be very easily solved with templates, but with C I am afraid that we may have to choose between converting everything into macro hell or accept the runtime cost of a dynamic QK value. Or just never change QK at all. |
|
I don't think the idea was to change |
|
Right, I was thinking of @qwopqwop200's implementation that showed some benefits from using a group size of 128 (if I understood that correctly). But as you say that is a completely different use case, my bad. |
ggerganov
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wondering if QK4_0, QK4_1 and QK8_0 wouldn't be better - this way I can search for 4_1 for example and get all related functions, usages, etc.
Sure, let's do that. That way we can also support >10 variants without confusion ;-) (at least in the numbering, the size of ggml.c would be another matter) |
Shortfixes the bug : ggml\src\ggml-cuda\cpy.cu:614: ggml_cuda_cpy_fn: unsupported type combination (q6_0 to f16) encountered when trying to use deepseek lite v2 with quantized K cache. Note: I compile my IK_Llama with GGML_CUDA_F16. To fix this, I added a cpy_blck_q_f16 function devised by comparing the cpy_blck_q8_0_f32 and cpy_blck_q8_0_f16, and transposing the difference for the other legacy quants on the basis of the cpy_blck_q_f32 function. A "rule of three" of sorts. Perplexity test and inference now works consistantly on -ctk q4_0 ; q4_1 ; q5_0 ; q5_1 in that scenario, with expected values and behavior. Except on Q6_0, which sees its perplexity multiplied by 100. (I suspect the Cuda dequantize_q6_0 to be incompatible with this PR for some reason, but that's beyond what I can fix) -ctk iq4_nl, which doesn't have yet a dequantize_iq4_nl function, is not usable that way for now.
This makes two somewhat tedious changes to ggml.c, that should help us with trying out other tensor types, as has been discussed recently in various issues/PRs.
QKfor each type, in the expectation that there may be different block sizes (e.g. Q4 with FP16 + 16 quants Investigate the performance (speed and perplexity) of Q4_0 with 2x F16 factors #995, or Q2/Q3)defaultcase for all the compute functionsThe argument against the latter by @ggerganov was #951 (comment):
My arguments for the change are:
abort().I have not updated tests/test-quantize.c with
QK, as that may be removed in pending #953.