fix(trainer): Correct loss scaling for incomplete gradient accumulation steps #39659

hutaiHang · 2025-07-25T09:11:15Z

What does this PR do?

This PR addresses an issue where the loss scaling during gradient accumulation is incorrect for the final optimizer step of an epoch if the total number of batches is not perfectly divisible by gradient_accumulation_steps.

Currently, the loss for each micro-batch is always divided by the configured args.gradient_accumulation_steps. This leads to the accumulated loss for the final, incomplete cycle being scaled down too much, resulting in an improperly small gradient update for that step.

This fix resolves the issue by dynamically tracking the number of micro-batches processed in each accumulation cycle and using this actual count for loss scaling.

The changes are as follows:

In the _inner_training_loop, a new instance variable self.cur_gradient_accumulation_steps is introduced. It is updated at the start of each optimizer step with the actual number of batches being processed (i.e., len(batch_samples)).
In the training_step method, the loss scaling logic now uses this dynamic self.cur_gradient_accumulation_steps value instead of the fixed self.args.gradient_accumulation_steps.

This ensures that the loss is correctly averaged over the number of batches that actually contributed to the gradient accumulation, regardless of whether the cycle was complete or not. This change has no new dependencies.

Fixes #38837

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case. (This PR is for issue Loss is incorrectly scaled in Trainer during the last step with gradient accumulation when the final batch is smaller than accumulation steps. #38837)
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

trainer: @SunMarc @zach-huggingface @qgallouedec

hutaiHang · 2025-07-25T11:33:35Z

Hi @SunMarc,

Thanks for your guidance in issue #38837. As you suggested, I've created this Pull Request to address the gradient accumulation scaling issue.

All CI checks have now passed. Could you please take a look when you have a moment?

Thank you!

qgallouedec

LGTM!

src/transformers/trainer.py

src/transformers/modeling_flash_attention_utils.py

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

hutaiHang · 2025-07-25T16:26:40Z

Hi @qgallouedec, thanks for the feedback! I've just pushed the requested changes. Could you please take another look when you have a chance?

qgallouedec · 2025-07-25T16:30:16Z

Looks good, let's see if the CI is happy

HuggingFaceDocBuilderDev · 2025-07-25T16:42:57Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

hutaiHang · 2025-07-25T17:10:00Z

@qgallouedec All checks have passed, thanks for the review. 😊

qgallouedec · 2025-07-25T17:19:08Z

Note for myself: it will silently break the grad accumulation in GRPOtrainer because this trainer oversample from the dataloader. I'll have to find a way to solve that

SunMarc

Thanks !

kaln27 · 2025-07-31T08:58:00Z

Hi @hutaiHang

There is a question. What if the last batch only has 1 data ? This only one data will contribute to all of the gradient, which means that the model will update to a direction that is not general (Cause we wanna batch size to be big to make each update to be general). That will cause training not stable.

I thinks we should keep the loss divided by self.args.gradient_accumulation_steps before backward process. At the end return the loss (right here loss is just for log) that is sacled to the correct one.

# trainer.py
        if self.use_apex:
            with amp.scale_loss(loss, self.optimizer) as scaled_loss:
                scaled_loss.backward()
        else:
            # Finally we need to normalize the loss for reporting if GA loss bug is not fixed during compute loss
            if not self.model_accepts_loss_kwargs and self.compute_loss_func is None:
                loss = loss / self.args.gradient_accumulation_steps

            # Turning off loss scaling w.r.t. gradient accumulation when DeepSpeed is enabled
            # https://github.com/huggingface/transformers/pull/35808
            if self.accelerator.distributed_type == DistributedType.DEEPSPEED:
                kwargs["scale_wrt_gas"] = False

            self.accelerator.backward(loss, **kwargs)

            rescale_loss = loss.detach() * self.args.gradient_accumulation_steps / self.current_gradient_accumulation_steps

            return rescale_loss

hutaiHang · 2025-07-31T09:14:40Z

Hi @kaln27

Thank you for bringing up this excellent point about training dynamics! It's a very important consideration.

You are absolutely right to be concerned about the potential instability if an optimizer step is updated based on a very small final batch (e.g., with a single sample). The gradient from such a small batch can indeed be noisy.

However, I believe the proposed solution in the PR is the correct way to fix the underlying mathematical bug in the Trainer, and the stability concern should be addressed at the data loading level. Let me break down my reasoning:

1. Gradient Magnitude Correctness (The goal of this PR)

The purpose of gradient accumulation is to simulate a larger batch size. The final gradient update should have a magnitude that is the average of the gradients from the processed micro-batches.

If we process 4 batches, the loss is summed and then divided by 4 before backward().
If we process 1 batch, the loss should be divided by 1 before backward().

This PR ensures this mathematical correctness. If we were to always divide by self.args.gradient_accumulation_steps as you suggested, the gradient for the final, incomplete step would be artificially suppressed (e.g., 1/4 of its correct magnitude). This would mean the model barely learns from those last few samples, which is also undesirable.

2. Training Stability (The concern you raised)

The question of whether one wants to perform an update based on a noisy, small batch is a matter of training strategy.

The standard and recommended way to handle this in transformers is to use the dataloader_drop_last=True argument in TrainingArguments. This tells the DataLoader to simply discard the last, incomplete batch, ensuring that every single optimizer step sees a full batch of data.

Conclusion:

This PR is focused on making the Trainer's default behavior mathematically correct. Users who are concerned about the stability implications of small final batches already have a direct and explicit tool (dataloader_drop_last) to manage this. The Trainer itself should not silently suppress gradients as a "feature," as that hides the true contribution of the data.

I hope this clarifies my approach. Let me know what you think!

…on steps (huggingface#39659) * Fix issue[huggingface#38837]: wrong loss scaled in last step of epoch * chore: trigger CI * Update src/transformers/trainer.py Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> * Update src/transformers/modeling_flash_attention_utils.py Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> --------- Co-authored-by: taihang <taihang@U-2RHYVWX7-2207.local> Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

taihang added 2 commits July 25, 2025 18:01

Fix issue[#38837]: wrong loss scaled in last step of epoch

701110c

chore: trigger CI

ae0f42a

qgallouedec approved these changes Jul 25, 2025

View reviewed changes

src/transformers/trainer.py Outdated Show resolved Hide resolved

src/transformers/modeling_flash_attention_utils.py Show resolved Hide resolved

hutaiHang and others added 3 commits July 25, 2025 23:37

Update src/transformers/trainer.py

b6b4b59

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

Update src/transformers/modeling_flash_attention_utils.py

d16dc5b

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

Merge branch 'main' into fix-issue-38837

da25064

qgallouedec approved these changes Jul 25, 2025

View reviewed changes

SunMarc approved these changes Jul 29, 2025

View reviewed changes

SunMarc merged commit 075dbbc into huggingface:main Jul 29, 2025
25 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(trainer): Correct loss scaling for incomplete gradient accumulation steps #39659

fix(trainer): Correct loss scaling for incomplete gradient accumulation steps #39659

Uh oh!

hutaiHang commented Jul 25, 2025 •

edited

Loading

hutaiHang commented Jul 25, 2025

qgallouedec left a comment

Uh oh!

Uh oh!

hutaiHang commented Jul 25, 2025

qgallouedec commented Jul 25, 2025

HuggingFaceDocBuilderDev commented Jul 25, 2025

hutaiHang commented Jul 25, 2025

qgallouedec commented Jul 25, 2025

SunMarc left a comment

Uh oh!

kaln27 commented Jul 31, 2025

hutaiHang commented Jul 31, 2025

Labels

5 participants

fix(trainer): Correct loss scaling for incomplete gradient accumulation steps #39659

fix(trainer): Correct loss scaling for incomplete gradient accumulation steps #39659

Uh oh!

Conversation

hutaiHang commented Jul 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Before submitting

Who can review?

hutaiHang commented Jul 25, 2025

qgallouedec left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

hutaiHang commented Jul 25, 2025

qgallouedec commented Jul 25, 2025

HuggingFaceDocBuilderDev commented Jul 25, 2025

hutaiHang commented Jul 25, 2025

qgallouedec commented Jul 25, 2025

SunMarc left a comment

Choose a reason for hiding this comment

Uh oh!

kaln27 commented Jul 31, 2025

hutaiHang commented Jul 31, 2025

Labels

5 participants

hutaiHang commented Jul 25, 2025 •

edited

Loading