fix: issues with unk and pad #225

stephantul · 2025-04-25T12:29:43Z

Some tokenizers use <|endoftext|> or other tokens as PAD token, this breaks the assumption in StaticModelForClassification.

Fixes #223

codecov · 2025-04-25T13:19:28Z

Codecov Report

Attention: Patch coverage is 67.21311% with 20 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
model2vec/distill/tokenizer.py	55.81%	19 Missing ⚠️
model2vec/distill/distillation.py	66.66%	1 Missing ⚠️

Files with missing lines	Coverage Δ
model2vec/distill/inference.py	`97.22% <100.00%> (+1.16%)`	⬆️
model2vec/distill/utils.py	`100.00% <100.00%> (ø)`
tests/test_distillation.py	`89.15% <100.00%> (-0.74%)`	⬇️
model2vec/distill/distillation.py	`93.52% <66.66%> (-0.72%)`	⬇️
model2vec/distill/tokenizer.py	`55.31% <55.81%> (+3.31%)`	⬆️

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

model2vec/distill/tokenizer.py

Pringled

LLLGTM

fix: issues with unk and pad

ad17768

stephantul requested a review from Pringled April 25, 2025 12:29

Pringled approved these changes Apr 25, 2025

View reviewed changes

fix tests

f02624d

stephantul added 2 commits April 25, 2025 15:23

upper case deprecated

1d2e815

Clearify code

5833c1c

Pringled reviewed Apr 26, 2025

View reviewed changes

model2vec/distill/tokenizer.py Outdated Show resolved Hide resolved

Pringled reviewed Apr 26, 2025

View reviewed changes

model2vec/distill/tokenizer.py Show resolved Hide resolved

fix: separate tokenizers

e025b9f

stephantul requested a review from Pringled April 27, 2025 05:44

Pringled approved these changes Apr 27, 2025

View reviewed changes

stephantul merged commit 0a07cd4 into main Apr 27, 2025
5 of 6 checks passed

stephantul deleted the fix-special-tokens-tokenizer-issue branch April 27, 2025 07:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: issues with unk and pad #225

fix: issues with unk and pad #225

Uh oh!

stephantul commented Apr 25, 2025

codecov bot commented Apr 25, 2025 •

edited

Loading

Uh oh!

Uh oh!

Pringled left a comment

Uh oh!

Labels

3 participants

fix: issues with unk and pad #225

fix: issues with unk and pad #225

Uh oh!

Conversation

stephantul commented Apr 25, 2025

codecov bot commented Apr 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Pringled left a comment

Choose a reason for hiding this comment

Uh oh!

Labels

3 participants

codecov bot commented Apr 25, 2025 •

edited

Loading