feat: Added embedding_dtype and vocabulary_quantization to config #280

Pringled · 2025-09-11T10:46:58Z

This PR adds the used vocabulary_quantization and embedding_dtype for quantization to the config, which makes it easier to see the precision of the model and (optionally) number of clusters for the vocabulary directly from the config.

E.g.:

{
    "model_type": "model2vec",
    "architectures": [
        "StaticModel"
    ],
    "tokenizer_name": "baai/bge-base-en-v1.5",
    "apply_pca": 256,
    "apply_zipf": true,
    "hidden_dim": 256,
    "seq_length": 1000000,
    "normalize": true,
    "vocabulary_quantization": 128,
    "embedding_dtype": "float16"
}

codecov · 2025-09-11T10:47:39Z

Codecov Report

✅ All modified and coverable lines are covered by tests.

Files with missing lines	Coverage Δ
model2vec/hf_utils.py	`80.34% <100.00%> (+0.87%)`	⬆️
model2vec/model.py	`95.45% <100.00%> (+0.16%)`	⬆️

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

stephantul

The model can also be vocabulary quantized during distillation. You actually don't need to add it to the config to know the model is vocabulary quantized.

The recipe is as follows: if the number of tokens is not equal to the number of embeddings, then the vocabulary quantization value is equal to the number of embeddings. So you can just not store it until the model is saved, and calculate it on the fly. So this is why I am hesitant to actually add this, you can only get out of date information. Same with the embedding dtype, I guess? It can only be outdated.

Maybe just make both a property of a static model?

Pringled · 2025-09-11T11:07:08Z

Right, if you load a model with quantization it's outdated, true. Still I think it's useful to know for a stored model what the dtype is; I was writing some tests and had a couple of quantized models, but to know the precision I had to actually load the safetensors. IMO it should be visible at least for saved models; what if we keep it in the config, and also make it a property of a static model that we update after loading? And then we also save it directly from the static model? That way it should never be outdated.

stephantul · 2025-09-11T12:42:35Z

Sure sounds good, I don't think you should ever read the config value at run-time, just write it when you save.

Added embedding_dtype and vocabulary_quantization to config

fb3b2f7

Pringled requested a review from stephantul September 11, 2025 10:47

stephantul suggested changes Sep 11, 2025

View reviewed changes

Added properties, updated config saving, updated tests

49fc262

Pringled merged commit 66c30a5 into main Sep 29, 2025
6 checks passed

Pringled deleted the add-quantization-type-to-config branch September 29, 2025 19:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Added embedding_dtype and vocabulary_quantization to config #280

feat: Added embedding_dtype and vocabulary_quantization to config #280

Uh oh!

Pringled commented Sep 11, 2025

codecov bot commented Sep 11, 2025 •

edited

Loading

stephantul left a comment

Pringled commented Sep 11, 2025

stephantul commented Sep 11, 2025

Uh oh!

Labels

3 participants

feat: Added embedding_dtype and vocabulary_quantization to config #280

feat: Added embedding_dtype and vocabulary_quantization to config #280

Uh oh!

Conversation

Pringled commented Sep 11, 2025

codecov bot commented Sep 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

stephantul left a comment

Choose a reason for hiding this comment

Pringled commented Sep 11, 2025

stephantul commented Sep 11, 2025

Uh oh!

Labels

3 participants

codecov bot commented Sep 11, 2025 •

edited

Loading