Skip to content

Conversation

@stephantul
Copy link
Contributor

Fixes #209 (again 😓 )

This PR adds a Token class to keep track of whether a token was originally a subword or added token. If it was a subword token, we should not pretokenize it. Previously, we relied on whether a token was in the original vocabulary, but this turned out to not work for unigram tokenizers with a metaspace if the token was also a subword token.

For "ELLE", for example:

  1. Check if "ELLE" is in vocab, we pretokenize it to "_ELLE". This is fine
  2. Then, when adding it, we need to check whether it is a subword. But "ELLE" is also a subword.
  3. So we don't pretokenize it before adding it, causing us to add two tokens with the same surface form.
@stephantul stephantul requested a review from Pringled April 24, 2025 18:19
@codecov
Copy link

codecov bot commented Apr 24, 2025

Codecov Report

Attention: Patch coverage is 84.21053% with 3 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
model2vec/distill/tokenizer.py 62.50% 3 Missing ⚠️
Files with missing lines Coverage Δ
model2vec/distill/distillation.py 94.24% <100.00%> (ø)
model2vec/distill/inference.py 96.05% <100.00%> (+0.05%) ⬆️
model2vec/distill/utils.py 100.00% <100.00%> (ø)
model2vec/distill/tokenizer.py 52.00% <62.50%> (+0.64%) ⬆️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
Copy link
Member

@Pringled Pringled left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@stephantul stephantul merged commit 39f02f6 into main Apr 25, 2025
5 of 6 checks passed
@stephantul stephantul deleted the fix-unigram-vocab branch April 25, 2025 06:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

3 participants