Skip to content

Conversation

@stephantul
Copy link
Contributor

This PR adds vocabulary quantization.

The quantization itself is handled by scikit-learn ,so that has been added as a dependency to quantization. There's a helper function to quantize already existing models, which can be imported directly:

from model2vec import quantize_model

model = quantize_model(model, vocabulary_quantization=1024)

The quantization itself necessitated a lot of internal changes. Most prominently, the embeddings and token weights are now decoupled. Every token still has a unique weight, but shares embeddings with tokens in the same cluster.
Old models can still be created and loaded, so there are no breaking changes. Quantized models can't be loaded using the old model2vec, however.

@stephantul stephantul changed the title Vocquant Aug 16, 2025
@codecov
Copy link

codecov bot commented Aug 16, 2025

Codecov Report

❌ Patch coverage is 92.00000% with 10 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
model2vec/distill/distillation.py 72.72% 3 Missing ⚠️
model2vec/model.py 92.30% 3 Missing ⚠️
model2vec/train/base.py 85.71% 2 Missing ⚠️
model2vec/vocabulary_quantization.py 90.47% 2 Missing ⚠️
Files with missing lines Coverage Δ
model2vec/__init__.py 100.00% <100.00%> (ø)
model2vec/distill/inference.py 97.50% <100.00%> (+0.03%) ⬆️
model2vec/hf_utils.py 79.46% <100.00%> (+3.22%) ⬆️
model2vec/inference/model.py 92.66% <100.00%> (+0.27%) ⬆️
model2vec/quantization.py 96.77% <100.00%> (+0.22%) ⬆️
model2vec/tokenizer/normalizer.py 95.23% <100.00%> (ø)
model2vec/train/classifier.py 97.56% <100.00%> (ø)
model2vec/train/base.py 97.89% <85.71%> (-2.11%) ⬇️
model2vec/vocabulary_quantization.py 90.47% <90.47%> (ø)
model2vec/distill/distillation.py 86.20% <72.72%> (-2.41%) ⬇️
... and 1 more
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
@stephantul stephantul requested a review from Pringled September 7, 2025 18:35
@stephantul
Copy link
Contributor Author

This new version stores all the relevant information (i.e., mappings, weights etc.) in the safetensors file.

Copy link
Member

@Pringled Pringled left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, tested it on a few models and it works well I think. Only thing missing is some documentation; can probably add a small snippet in https://github.com/MinishLab/model2vec/blob/main/docs/usage.md?

# Store the original dtype to restore it later
orig_dtype = embeddings.dtype

kmeans = KMeans(n_clusters=n_clusters, random_state=42, init="random")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is quite slow if the vocab is very large, is MiniBatchKmeans an option (or an optinal arg) that could help?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe, we could consider that for a follow up if you think that's interesting

@stephantul stephantul merged commit 7bf0bf0 into main Sep 9, 2025
5 of 6 checks passed
@stephantul stephantul deleted the vocquant branch September 9, 2025 11:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

3 participants