Skip to content

Conversation

@sw
Copy link
Contributor

@sw sw commented Apr 15, 2023

This makes two somewhat tedious changes to ggml.c, that should help us with trying out other tensor types, as has been discussed recently in various issues/PRs.

The argument against the latter by @ggerganov was #951 (comment):

It's easier to search for all the places that a type is used.
It does get annoying, so maybe we should reconsider. Overall, there is a lot of room for simplification and reduction of code duplication in ggml.c

My arguments for the change are:

  • shortens the file by >200 lines
  • makes diffs less noisy when you add a new type
  • the assert now also catches invalid values caused by memory corruption or other bugs, instead of the function silently exiting
  • "all the places that a type is used" -> it's not a "usage" in any practical sense as it just causes an abort().

I have not updated tests/test-quantize.c with QK, as that may be removed in pending #953.

@sw sw requested a review from ggerganov April 15, 2023 15:43
@slaren
Copy link
Member

slaren commented Apr 15, 2023

I think this is a good change, but I am concerned that we won't be able to change QK without breaking backwards compatibility, and if we ever want to support a different value it will require much deeper changes anyway. I think this could be very easily solved with templates, but with C I am afraid that we may have to choose between converting everything into macro hell or accept the runtime cost of a dynamic QK value. Or just never change QK at all.

@sw
Copy link
Contributor Author

sw commented Apr 15, 2023

I don't think the idea was to change QK for an existing type, rather to add e.g. Q4_2 which will have its own QK42 != 32

@slaren
Copy link
Member

slaren commented Apr 15, 2023

Right, I was thinking of @qwopqwop200's implementation that showed some benefits from using a group size of 128 (if I understood that correctly). But as you say that is a completely different use case, my bad.

Copy link
Member

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering if QK4_0, QK4_1 and QK8_0 wouldn't be better - this way I can search for 4_1 for example and get all related functions, usages, etc.

@sw
Copy link
Contributor Author

sw commented Apr 15, 2023

Wondering if QK4_0, QK4_1 and QK8_0 wouldn't be better

Sure, let's do that. That way we can also support >10 variants without confusion ;-) (at least in the numbering, the size of ggml.c would be another matter)

@sw sw merged commit 0ad9646 into ggml-org:master Apr 15, 2023
@sw sw deleted the ggml-refactor branch April 15, 2023 16:25
SamuelOliveirads pushed a commit to SamuelOliveirads/llama.cpp that referenced this pull request Dec 29, 2025
Shortfixes the bug : ggml\src\ggml-cuda\cpy.cu:614: ggml_cuda_cpy_fn: unsupported type combination (q6_0 to f16) encountered when trying to use deepseek lite v2 with quantized K cache. Note: I compile my IK_Llama with GGML_CUDA_F16.

To fix this, I added a cpy_blck_q_f16 function devised by comparing the cpy_blck_q8_0_f32 and cpy_blck_q8_0_f16, and transposing the difference for the other legacy quants on the basis of the cpy_blck_q_f32 function. A "rule of three" of sorts.

Perplexity test and inference now works consistantly on -ctk q4_0 ; q4_1 ; q5_0 ; q5_1 in that scenario, with expected values and behavior.

Except on Q6_0, which sees its perplexity multiplied by 100. (I suspect the Cuda dequantize_q6_0 to be incompatible with this PR for some reason, but that's beyond what I can fix)

-ctk iq4_nl, which doesn't have yet a dequantize_iq4_nl function, is not usable that way for now.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

3 participants