Skip to content

[BUG] Paddle_OCR_(1B)_Vision.ipynb UnslothVisionDataCollator: Incorrect masking boundary causes fine-tuned models to generate premature EOS (~90% empty outputs) #149

@ljcamargo

Description

@ljcamargo

Description

Summary

The UnslothVisionDataCollator with train_on_responses_only=True cannot cleanly mask the prompt/response boundary due to how the tokenizer merge spaces with following character (doesn't seem to happen in other charsets but in Latin). This results in the non-masked trainable sequence always including an unwanted leading space, which causes fine-tuned models to yield end of sequence at the very first most of the times, effectively resulting on empty inferences above 90% of the times. Nonetheless, the fine tuned models achieves correct inference when setting min_new_tokens=2 and raising temperature above 1, but of course this is less then optimal workaround as the first characters is hallucinated most of the times and needs to be removed.

Environment

  • As set on the notebook

The Problem

When using:

custom_collator = UnslothVisionDataCollator(
    model=model,
    processor=processor,
    ignore_index=-100,
    max_seq_length=2048,
    train_on_responses_only=True,
    instruction_part="User: ",
    response_part="\nAssistant:",
    pad_to_multiple_of=8,
)

The collator successfully masks the prompt tokens [..., 92267, 93963] (corresponding to "Assistant:"), but the very next token includes both the space separator AND the first character of the response merged together as a single token, setting response_part="\nAssitant: ", (with trailing space) doesn't work either as the mask fails entirely.

Digging into the issue I found the "problem" is due to the model tokenizer which makes impossible to mask precisely after the "Assistant:", perhaps this problem is not frequent math formulas from the dataset used in the notebook nor other charsets like chinese but will fail on any other finetunings of latin charsets.

How Subword Tokenizers Merge Spaces

The tokenizer of this models does the following

# The actual training data tokenizes like this:
"...Assistant: Zuh tēe..." → [..., 92267, 93963, 1276, 3269, ...]
#                                  'Assistant'  ':'    ' Z'   'uh'
#                                                        ↑
#                                              Token 1276 = space + 'Z' as ONE token

The fundamental issue: Token 1276 represents ' Z' as an indivisible unit. You cannot mask the space part while leaving the 'Z' part unmasked—tokens are atomic.

This means:

  • ✅ We can mask everything up to : (token 93963)
  • ❌ The next token 1276 = ' Z' must be either fully masked or fully unmasked
  • ❌ The space is part of the response token, so the model trains on it

Visualization of the Masking Boundary

Using a diagnostic script to inspect what the collator produces:

Index    Token ID     Label        Masked?    Decoded Token        
--------------------------------------------------------------------------------
207      92267        -100         YES ❌      'Assistant'            ← Token 92267
208      93963        -100         YES ❌      ':'                    ← Token 93963
209      1276         1276         NO ✅       ' Z'                   ← TRANSITION
210      3269         3269         NO ✅       'uh'


MASKED SECTION (prompt - should NOT be trained on):
'<|begin_of_sentence|>User: <|IMAGE_START|>...<|IMAGE_END|>OCR:\nAssistant:'

UNMASKED SECTION (response - SHOULD be trained on):
' Zuh tēe incā nación jāá nduú jínī...'
↑ Note: Starts with space because it's merged with first character

The problem: The response section being trained on starts with ' Z' (space + Z), when it should start with just 'Z'. The space should belong to the prompt formatting, not the response content.

Besides, as far as I researched, this tokenizer doesn't provide any special token that ensures clean boundary, on further tests which I discussed below I added a new special token <|boundary|> to the tokenizer but this changes the provided tokenizer which may be undesirable.

Impact on Fine-tuned Model

After fine-tuning with this masking issue, the model exhibits these problems during inference:

  1. ~90% empty outputs: When using standard generation parameters like temperature=1.5, min_p=0.1 (as suggested in Unsloth's example notebooks), the model generates the EOS token immediately in ~90% of cases, producing empty results.

  2. Requires workarounds: The issue can be masked by:

    • Adding min_new_tokens=2 to force generation
    • Raising temperature even more.
    • Of course this workaround are not proper solutions.

Workarounds

Workaround 1: Add a non-merging separator token

# Modify training data to include a separator that doesn't merge
formatted_text = formatted_text.replace("Assistant: ", "Assistant: <|BOUNDARY|>")

# Update collator
custom_collator = UnslothVisionDataCollator(
    response_part="\nAssistant: <|BOUNDARY|>",  # Now masks cleanly
    ...
)

This works because special tokens like <|BOUNDARY|> are always separate tokens and never merge with adjacent text. This may not be convenient if changing the tokenizer is undesired, in fact, in my preliminary tests, the modified tokenizer cannot be instantiated once pushed back to HF and downloaded again, possibly because another bug in the model code.

Workaround 2: Accept training on the leading space
Use response_part="\nAssistant:" (without trailing space) and accept that the first response token includes a space. Then use min_new_tokens=2 during inference to avoid premature EOS generation but with the secondary effect of having an allucinated character at the first which has to be removed, which is less than optimal because in the 10% of the case where the prediction is accurate, the first character is obviously lost.

Proposed Solutions

Solution: Document the UnslothVisionDataCollator limitation clearly by explaining the space-merging behavior in subword tokenizers and or escalate and create an issue for the mantainers of this collator to seek for workarounds for exact masking, like splitting the pretokenized string into prompt/response and then tokenize, currently is hard to achieve in the notebook without heavy collator refactoring.

Solution: Standarize tokenizer new <|BOUNDARY|> special token addition process in the notebook to make it usable for most finetuning cases "out of the box".

Expected Behavior

Ideally, the notebook should:

  1. Warn the user about the issue
  2. Add new boundary token to tokenizer and keep the new tokenizer usable after published
  3. Document this limitation prominently.

or propose changes to UnslothVisionDataCollator

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions