feat: track token provenance #222

stephantul · 2025-04-24T18:19:17Z

Fixes #209 (again 😓 )

This PR adds a Token class to keep track of whether a token was originally a subword or added token. If it was a subword token, we should not pretokenize it. Previously, we relied on whether a token was in the original vocabulary, but this turned out to not work for unigram tokenizers with a metaspace if the token was also a subword token.

For "ELLE", for example:

Check if "ELLE" is in vocab, we pretokenize it to "_ELLE". This is fine
Then, when adding it, we need to check whether it is a subword. But "ELLE" is also a subword.
So we don't pretokenize it before adding it, causing us to add two tokens with the same surface form.

codecov · 2025-04-24T18:19:45Z

Codecov Report

Attention: Patch coverage is 84.21053% with 3 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
model2vec/distill/tokenizer.py	62.50%	3 Missing ⚠️

Files with missing lines	Coverage Δ
model2vec/distill/distillation.py	`94.24% <100.00%> (ø)`
model2vec/distill/inference.py	`96.05% <100.00%> (+0.05%)`	⬆️
model2vec/distill/utils.py	`100.00% <100.00%> (ø)`
model2vec/distill/tokenizer.py	`52.00% <62.50%> (+0.64%)`	⬆️

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Pringled

LGTM

feat: track token provenance

b42d9cf

stephantul requested a review from Pringled April 24, 2025 18:19

stephantul mentioned this pull request Apr 24, 2025

Using custom vocabulary with subword tokenizer #209

Closed

Pringled approved these changes Apr 25, 2025

View reviewed changes

stephantul merged commit 39f02f6 into main Apr 25, 2025
5 of 6 checks passed

stephantul deleted the fix-unigram-vocab branch April 25, 2025 06:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: track token provenance #222

feat: track token provenance #222

Uh oh!

stephantul commented Apr 24, 2025

codecov bot commented Apr 24, 2025 •

edited

Loading

Pringled left a comment

Uh oh!

Labels

3 participants

feat: track token provenance #222

feat: track token provenance #222

Uh oh!

Conversation

stephantul commented Apr 24, 2025

codecov bot commented Apr 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Pringled left a comment

Choose a reason for hiding this comment

Uh oh!

Labels

3 participants

codecov bot commented Apr 24, 2025 •

edited

Loading