feat: add vocabulary quantization #271

stephantul · 2025-08-16T05:48:38Z

This PR adds vocabulary quantization.

The quantization itself is handled by scikit-learn ,so that has been added as a dependency to quantization. There's a helper function to quantize already existing models, which can be imported directly:

from model2vec import quantize_model

model = quantize_model(model, vocabulary_quantization=1024)

The quantization itself necessitated a lot of internal changes. Most prominently, the embeddings and token weights are now decoupled. Every token still has a unique weight, but shares embeddings with tokens in the same cluster.
Old models can still be created and loaded, so there are no breaking changes. Quantized models can't be loaded using the old model2vec, however.

codecov · 2025-08-16T05:49:50Z

Codecov Report

❌ Patch coverage is 92.00000% with 10 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
model2vec/distill/distillation.py	72.72%	3 Missing ⚠️
model2vec/model.py	92.30%	3 Missing ⚠️
model2vec/train/base.py	85.71%	2 Missing ⚠️
model2vec/vocabulary_quantization.py	90.47%	2 Missing ⚠️

Files with missing lines	Coverage Δ
model2vec/__init__.py	`100.00% <100.00%> (ø)`
model2vec/distill/inference.py	`97.50% <100.00%> (+0.03%)`	⬆️
model2vec/hf_utils.py	`79.46% <100.00%> (+3.22%)`	⬆️
model2vec/inference/model.py	`92.66% <100.00%> (+0.27%)`	⬆️
model2vec/quantization.py	`96.77% <100.00%> (+0.22%)`	⬆️
model2vec/tokenizer/normalizer.py	`95.23% <100.00%> (ø)`
model2vec/train/classifier.py	`97.56% <100.00%> (ø)`
model2vec/train/base.py	`97.89% <85.71%> (-2.11%)`	⬇️
model2vec/vocabulary_quantization.py	`90.47% <90.47%> (ø)`
model2vec/distill/distillation.py	`86.20% <72.72%> (-2.41%)`	⬇️
... and 1 more

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

stephantul · 2025-09-07T18:41:48Z

This new version stores all the relevant information (i.e., mappings, weights etc.) in the safetensors file.

Pringled

LGTM, tested it on a few models and it works well I think. Only thing missing is some documentation; can probably add a small snippet in https://github.com/MinishLab/model2vec/blob/main/docs/usage.md?

Pringled · 2025-09-09T10:58:24Z

model2vec/vocabulary_quantization.py

+    # Store the original dtype to restore it later
+    orig_dtype = embeddings.dtype
+
+    kmeans = KMeans(n_clusters=n_clusters, random_state=42, init="random")


I think this is quite slow if the vocab is very large, is MiniBatchKmeans an option (or an optinal arg) that could help?

Maybe, we could consider that for a follow up if you think that's interesting

stephantul added 30 commits April 9, 2025 19:10

remove multiword warning

b172b6d

add superbpe tokenizers

d24c387

fix: pretokenize tokens before checking vocabulary

084a9f4

feat: add quantization

836f7ac

Merge branch 'main' into add-superbpe

27e856f

wip

8e746ab

wip

ed5ebca

Merge branch 'fix-duplicate-tokens' into vocquant

b6a659c

wip

38835e6

merge

d167112

fixes

ae3b583

fixes

7c0648b

merge

5f36097

fix issue with mwe

c9e7d14

Merge branch 'add-superbpe' into vocquant

ff87209

merge

af2cadd

wip

3948706

wip

0eb32e0

wip

fcd1ebd

Merge branch 'main' into vocquant

bb49b59

wip

70471a6

wip

93b41ea

wip

01317c0

wip

1a8569a

fixes

9fe7e33

Merge branch 'main' into vocquant

87f61fe

Merge branch 'main' into vocquant

e1a5ce5

Merge branch 'main' into vocquant

62324f9

fix: refactor quantization

e2a95e5

fix: refactor quantization

357a487

stephantul added 6 commits July 29, 2025 07:57

wip

75bdf4f

wip

b608081

typing

3b61fec

fixes

9d41dfc

fix typing/linting

3880632

add quantization helper to top

ef2ef02

stephantul changed the title ~~Vocquant~~ Aug 16, 2025

stephantul added 7 commits August 16, 2025 07:50

change init to random

d04b42c

fix: annotations import

0d6a7ae

fix test import

c518362

Merge branch 'main' into vocquant

8ffa626

import Union for 3.9

f1528cb

fix: union again

75decb5

store all relevant info in safetensors

ec376c1

stephantul requested a review from Pringled September 7, 2025 18:35

merge

39b3881

make weights float in training

e59cab5

Pringled approved these changes Sep 9, 2025

View reviewed changes

merge

bcc6929

stephantul merged commit 7bf0bf0 into main Sep 9, 2025
5 of 6 checks passed

stephantul deleted the vocquant branch September 9, 2025 11:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add vocabulary quantization #271

feat: add vocabulary quantization #271

Uh oh!

stephantul commented Aug 16, 2025

codecov bot commented Aug 16, 2025 •

edited

Loading

stephantul commented Sep 7, 2025

Pringled left a comment

Pringled Sep 9, 2025

stephantul Sep 9, 2025

Uh oh!

Labels

3 participants

feat: add vocabulary quantization #271

feat: add vocabulary quantization #271

Uh oh!

Conversation

stephantul commented Aug 16, 2025

codecov bot commented Aug 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

stephantul commented Sep 7, 2025

Pringled left a comment

Choose a reason for hiding this comment

Pringled Sep 9, 2025

Choose a reason for hiding this comment

stephantul Sep 9, 2025

Choose a reason for hiding this comment

Uh oh!

Labels

3 participants

codecov bot commented Aug 16, 2025 •

edited

Loading