Skip to content

Conversation

@stephantul
Copy link
Contributor

Some tokenizers use <|endoftext|> or other tokens as PAD token, this breaks the assumption in StaticModelForClassification.

Fixes #223

@stephantul stephantul requested a review from Pringled April 25, 2025 12:29
@codecov
Copy link

codecov bot commented Apr 25, 2025

Codecov Report

Attention: Patch coverage is 67.21311% with 20 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
model2vec/distill/tokenizer.py 55.81% 19 Missing ⚠️
model2vec/distill/distillation.py 66.66% 1 Missing ⚠️
Files with missing lines Coverage Δ
model2vec/distill/inference.py 97.22% <100.00%> (+1.16%) ⬆️
model2vec/distill/utils.py 100.00% <100.00%> (ø)
tests/test_distillation.py 89.15% <100.00%> (-0.74%) ⬇️
model2vec/distill/distillation.py 93.52% <66.66%> (-0.72%) ⬇️
model2vec/distill/tokenizer.py 55.31% <55.81%> (+3.31%) ⬆️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
@stephantul stephantul requested a review from Pringled April 27, 2025 05:44
Copy link
Member

@Pringled Pringled left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LLLGTM

@stephantul stephantul merged commit 0a07cd4 into main Apr 27, 2025
5 of 6 checks passed
@stephantul stephantul deleted the fix-special-tokens-tokenizer-issue branch April 27, 2025 07:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

3 participants