Skip to content

Conversation

@stephantul
Copy link
Contributor

@stephantul stephantul commented May 4, 2025

Adds superbpe-like tokens support.

To test:

from model2vec.distill import distill

model_name = "baai/bge-base-en-v1.5"

model = distill(model_name, pca_dims=256, quantize_to="int8", vocabulary=["chat-gpt", "room for the moon"])

[model.tokens[x] for x in model.tokenize(["chat-gpt room for the moon, is great!"])[0]]
@codecov
Copy link

codecov bot commented May 4, 2025

Codecov Report

Attention: Patch coverage is 92.28792% with 30 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
model2vec/tokenizer/tokenizer.py 92.50% 12 Missing ⚠️
model2vec/tokenizer/pretokenizer.py 75.75% 8 Missing ⚠️
model2vec/distill/distillation.py 72.72% 6 Missing ⚠️
tests/test_distillation.py 75.00% 2 Missing ⚠️
tests/conftest.py 87.50% 1 Missing ⚠️
tests/test_tokenizer.py 98.63% 1 Missing ⚠️
Files with missing lines Coverage Δ
model2vec/distill/inference.py 97.46% <100.00%> (+0.24%) ⬆️
model2vec/distill/utils.py 100.00% <ø> (ø)
model2vec/model.py 94.19% <100.00%> (+0.03%) ⬆️
model2vec/tokenizer/__init__.py 100.00% <100.00%> (ø)
model2vec/tokenizer/datamodels.py 100.00% <100.00%> (ø)
model2vec/tokenizer/model.py 100.00% <100.00%> (ø)
model2vec/tokenizer/normalizer.py 100.00% <100.00%> (ø)
tests/conftest.py 98.50% <87.50%> (-1.50%) ⬇️
tests/test_tokenizer.py 98.63% <98.63%> (ø)
tests/test_distillation.py 89.02% <75.00%> (-0.14%) ⬇️
... and 3 more
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
@stephantul stephantul changed the title Add superbpe May 22, 2025
@stephantul stephantul marked this pull request as ready for review May 22, 2025 10:51
@stephantul stephantul requested a review from Pringled May 22, 2025 10:51
Copy link
Member

@Pringled Pringled left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verrrrry nice! LGTM, I tested it a bit more and I think everything works.

import os
import re
from typing import Literal, Union
from typing import cast
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😮

@stephantul stephantul merged commit 80338f2 into main May 26, 2025
5 of 6 checks passed
@stephantul stephantul deleted the add-superbpe branch May 26, 2025 12:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

3 participants