Refactor ggml.c for future tensor types #1001

sw · 2023-04-15T15:34:02Z

This makes two somewhat tedious changes to ggml.c, that should help us with trying out other tensor types, as has been discussed recently in various issues/PRs.

define a separate QK for each type, in the expectation that there may be different block sizes (e.g. Q4 with FP16 + 16 quants Investigate the performance (speed and perplexity) of Q4_0 with 2x F16 factors #995, or Q2/Q3)
use a default case for all the compute functions

The argument against the latter by @ggerganov was #951 (comment):

It's easier to search for all the places that a type is used.
It does get annoying, so maybe we should reconsider. Overall, there is a lot of room for simplification and reduction of code duplication in ggml.c

My arguments for the change are:

shortens the file by >200 lines
makes diffs less noisy when you add a new type
the assert now also catches invalid values caused by memory corruption or other bugs, instead of the function silently exiting
"all the places that a type is used" -> it's not a "usage" in any practical sense as it just causes an abort().

I have not updated tests/test-quantize.c with QK, as that may be removed in pending #953.

slaren · 2023-04-15T15:46:47Z

I think this is a good change, but I am concerned that we won't be able to change QK without breaking backwards compatibility, and if we ever want to support a different value it will require much deeper changes anyway. I think this could be very easily solved with templates, but with C I am afraid that we may have to choose between converting everything into macro hell or accept the runtime cost of a dynamic QK value. Or just never change QK at all.

sw · 2023-04-15T15:49:49Z

I don't think the idea was to change QK for an existing type, rather to add e.g. Q4_2 which will have its own QK42 != 32

slaren · 2023-04-15T15:56:46Z

Right, I was thinking of @qwopqwop200's implementation that showed some benefits from using a group size of 128 (if I understood that correctly). But as you say that is a completely different use case, my bad.

ggerganov

Wondering if QK4_0, QK4_1 and QK8_0 wouldn't be better - this way I can search for 4_1 for example and get all related functions, usages, etc.

sw · 2023-04-15T16:16:58Z

Wondering if QK4_0, QK4_1 and QK8_0 wouldn't be better

Sure, let's do that. That way we can also support >10 variants without confusion ;-) (at least in the numbering, the size of ggml.c would be another matter)

Shortfixes the bug : ggml\src\ggml-cuda\cpy.cu:614: ggml_cuda_cpy_fn: unsupported type combination (q6_0 to f16) encountered when trying to use deepseek lite v2 with quantized K cache. Note: I compile my IK_Llama with GGML_CUDA_F16. To fix this, I added a cpy_blck_q_f16 function devised by comparing the cpy_blck_q8_0_f32 and cpy_blck_q8_0_f16, and transposing the difference for the other legacy quants on the basis of the cpy_blck_q_f32 function. A "rule of three" of sorts. Perplexity test and inference now works consistantly on -ctk q4_0 ; q4_1 ; q5_0 ; q5_1 in that scenario, with expected values and behavior. Except on Q6_0, which sees its perplexity multiplied by 100. (I suspect the Cuda dequantize_q6_0 to be incompatible with this PR for some reason, but that's beyond what I can fix) -ctk iq4_nl, which doesn't have yet a dequantize_iq4_nl function, is not usable that way for now.

Refactor ggml.c for future tensor types

472145c

sw requested a review from ggerganov April 15, 2023 15:43

ggerganov approved these changes Apr 15, 2023

View reviewed changes

QK40 -> QK4_0 etc.

524b201

sw merged commit 0ad9646 into ggml-org:master Apr 15, 2023

sw deleted the ggml-refactor branch April 15, 2023 16:25

sw mentioned this pull request Apr 15, 2023

Clean up QK and file and tensor types #678

Closed

Bearsaerker mentioned this pull request Mar 12, 2025

Eval bug: Gemma 3 extremly slow prompt processing when using quantized kv cache. #12352

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor ggml.c for future tensor types #1001

Refactor ggml.c for future tensor types #1001

Uh oh!

sw commented Apr 15, 2023 •

edited

Loading

slaren commented Apr 15, 2023

sw commented Apr 15, 2023

slaren commented Apr 15, 2023

ggerganov left a comment

sw commented Apr 15, 2023

Labels

3 participants

Refactor ggml.c for future tensor types #1001

Refactor ggml.c for future tensor types #1001

Uh oh!

Conversation

sw commented Apr 15, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

slaren commented Apr 15, 2023

sw commented Apr 15, 2023

slaren commented Apr 15, 2023

ggerganov left a comment

Choose a reason for hiding this comment

sw commented Apr 15, 2023

Labels

3 participants

sw commented Apr 15, 2023 •

edited

Loading