Skip to content

Conversation

@stephantul
Copy link
Contributor

@stephantul stephantul commented Mar 26, 2025

This PR completely rewrites the distill backend.

  • There's no longer a difference between subword and vocabulary inference
  • BPE and unigram tokenizer token removal/addition is supported
  • All tokenizers now take into account normalization when adding tokens
  • Tokens that are multi-word units are no longer allowed
  • Make tokenizers a bit smaller by removing unneeded special tokens
@codecov
Copy link

codecov bot commented Mar 26, 2025

Codecov Report

Attention: Patch coverage is 76.30058% with 41 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
model2vec/distill/tokenizer.py 47.05% 36 Missing ⚠️
model2vec/distill/distillation.py 91.66% 3 Missing ⚠️
model2vec/distill/inference.py 95.55% 2 Missing ⚠️
Files with missing lines Coverage Δ
model2vec/distill/utils.py 100.00% <100.00%> (ø)
tests/conftest.py 100.00% <100.00%> (ø)
tests/test_distillation.py 89.88% <100.00%> (-0.12%) ⬇️
model2vec/distill/inference.py 95.58% <95.55%> (+1.74%) ⬆️
model2vec/distill/distillation.py 94.73% <91.66%> (-1.10%) ⬇️
model2vec/distill/tokenizer.py 51.35% <47.05%> (-24.52%) ⬇️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
Copy link
Member

@Pringled Pringled left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BIG CLEAN 🧹 few small concerns

@stephantul stephantul requested a review from Pringled March 30, 2025 12:11
@stephantul stephantul requested a review from Pringled March 31, 2025 11:47
@stephantul stephantul merged commit 844c3fa into main Apr 9, 2025
5 of 6 checks passed
@stephantul stephantul deleted the rewrite_backend branch April 9, 2025 14:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

3 participants