-
Notifications
You must be signed in to change notification settings - Fork 31.3k
fix(trainer): Correct loss scaling for incomplete gradient accumulation steps #39659
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
qgallouedec
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
|
Hi @qgallouedec, thanks for the feedback! I've just pushed the requested changes. Could you please take another look when you have a chance? |
|
Looks good, let's see if the CI is happy |
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
|
@qgallouedec All checks have passed, thanks for the review. 😊 |
|
Note for myself: it will silently break the grad accumulation in GRPOtrainer because this trainer oversample from the dataloader. I'll have to find a way to solve that |
SunMarc
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks !
|
Hi @hutaiHang There is a question. What if the last batch only has 1 data ? This only one data will contribute to all of the gradient, which means that the model will update to a direction that is not general (Cause we wanna batch size to be big to make each update to be general). That will cause training not stable. I thinks we should keep the loss divided by # trainer.py
if self.use_apex:
with amp.scale_loss(loss, self.optimizer) as scaled_loss:
scaled_loss.backward()
else:
# Finally we need to normalize the loss for reporting if GA loss bug is not fixed during compute loss
if not self.model_accepts_loss_kwargs and self.compute_loss_func is None:
loss = loss / self.args.gradient_accumulation_steps
# Turning off loss scaling w.r.t. gradient accumulation when DeepSpeed is enabled
# https://github.com/huggingface/transformers/pull/35808
if self.accelerator.distributed_type == DistributedType.DEEPSPEED:
kwargs["scale_wrt_gas"] = False
self.accelerator.backward(loss, **kwargs)
rescale_loss = loss.detach() * self.args.gradient_accumulation_steps / self.current_gradient_accumulation_steps
return rescale_loss |
|
Hi @kaln27 Thank you for bringing up this excellent point about training dynamics! It's a very important consideration. You are absolutely right to be concerned about the potential instability if an optimizer step is updated based on a very small final batch (e.g., with a single sample). The gradient from such a small batch can indeed be noisy. However, I believe the proposed solution in the PR is the correct way to fix the underlying mathematical bug in the 1. Gradient Magnitude Correctness (The goal of this PR) The purpose of gradient accumulation is to simulate a larger batch size. The final gradient update should have a magnitude that is the average of the gradients from the processed micro-batches.
This PR ensures this mathematical correctness. If we were to always divide by 2. Training Stability (The concern you raised) The question of whether one wants to perform an update based on a noisy, small batch is a matter of training strategy. The standard and recommended way to handle this in Conclusion: This PR is focused on making the I hope this clarifies my approach. Let me know what you think! |
…on steps (huggingface#39659) * Fix issue[huggingface#38837]: wrong loss scaled in last step of epoch * chore: trigger CI * Update src/transformers/trainer.py Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> * Update src/transformers/modeling_flash_attention_utils.py Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> --------- Co-authored-by: taihang <taihang@U-2RHYVWX7-2207.local> Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
…on steps (huggingface#39659) * Fix issue[huggingface#38837]: wrong loss scaled in last step of epoch * chore: trigger CI * Update src/transformers/trainer.py Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> * Update src/transformers/modeling_flash_attention_utils.py Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> --------- Co-authored-by: taihang <taihang@U-2RHYVWX7-2207.local> Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
…on steps (huggingface#39659) * Fix issue[huggingface#38837]: wrong loss scaled in last step of epoch * chore: trigger CI * Update src/transformers/trainer.py Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> * Update src/transformers/modeling_flash_attention_utils.py Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> --------- Co-authored-by: taihang <taihang@U-2RHYVWX7-2207.local> Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
…on steps (huggingface#39659) * Fix issue[huggingface#38837]: wrong loss scaled in last step of epoch * chore: trigger CI * Update src/transformers/trainer.py Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> * Update src/transformers/modeling_flash_attention_utils.py Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> --------- Co-authored-by: taihang <taihang@U-2RHYVWX7-2207.local> Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
…on steps (huggingface#39659) * Fix issue[huggingface#38837]: wrong loss scaled in last step of epoch * chore: trigger CI * Update src/transformers/trainer.py Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> * Update src/transformers/modeling_flash_attention_utils.py Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> --------- Co-authored-by: taihang <taihang@U-2RHYVWX7-2207.local> Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
…on steps (huggingface#39659) * Fix issue[huggingface#38837]: wrong loss scaled in last step of epoch * chore: trigger CI * Update src/transformers/trainer.py Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> * Update src/transformers/modeling_flash_attention_utils.py Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> --------- Co-authored-by: taihang <taihang@U-2RHYVWX7-2207.local> Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
…on steps (huggingface#39659) * Fix issue[huggingface#38837]: wrong loss scaled in last step of epoch * chore: trigger CI * Update src/transformers/trainer.py Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> * Update src/transformers/modeling_flash_attention_utils.py Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> --------- Co-authored-by: taihang <taihang@U-2RHYVWX7-2207.local> Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
What does this PR do?
This PR addresses an issue where the loss scaling during gradient accumulation is incorrect for the final optimizer step of an epoch if the total number of batches is not perfectly divisible by
gradient_accumulation_steps.Currently, the loss for each micro-batch is always divided by the configured
args.gradient_accumulation_steps. This leads to the accumulated loss for the final, incomplete cycle being scaled down too much, resulting in an improperly small gradient update for that step.This fix resolves the issue by dynamically tracking the number of micro-batches processed in each accumulation cycle and using this actual count for loss scaling.
The changes are as follows:
_inner_training_loop, a new instance variableself.cur_gradient_accumulation_stepsis introduced. It is updated at the start of each optimizer step with the actual number of batches being processed (i.e.,len(batch_samples)).training_stepmethod, the loss scaling logic now uses this dynamicself.cur_gradient_accumulation_stepsvalue instead of the fixedself.args.gradient_accumulation_steps.This ensures that the loss is correctly averaged over the number of batches that actually contributed to the gradient accumulation, regardless of whether the cycle was complete or not. This change has no new dependencies.
Fixes #38837
Before submitting
Pull Request section?
to it if that's the case. (This PR is for issue Loss is incorrectly scaled in Trainer during the last step with gradient accumulation when the final batch is smaller than accumulation steps. #38837)
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?