Skip to content

Conversation

@nijkah
Copy link
Contributor

@nijkah nijkah commented Oct 20, 2025

What this does

This PR adds an opt-in auto-scaling feature for distributed training with Accelerate. When enabled via --auto_scale=true and running with multiple processes (GPUs), LeRobot:

  • Multiplies the learning rate by world size (linear LR scaling).
  • Divides the number of training steps by world size (ceil-div).

This keeps total sample count roughly comparable and preserves training dynamics while improving throughput.

How it was tested

Examples:

accelerate launch \
  --multi_gpu \
  --num_processes=2 \
  $(which lerobot-train) \
  --dataset.repo_id=${HF_USER}/my_dataset \
  --policy.type=act \
  --policy.repo_id=${HF_USER}/my_trained_policy \
  --output_dir=outputs/train/act_multi_gpu \
  --job_name=act_multi_gpu \
  --wandb.enable=true \
  --auto_scale=true

Tested

  • A6000 x 2 GPU training
- Add `auto_scale` flag to TrainPipelineConfig
- Scale optimizer LR by world size and divide total steps accordingly when using Accelerate with multiple GPUs
- Apply scaling before optimizer/scheduler creation and before logging config
- Update multi-GPU docs with `--auto_scale=true` usage and explanation
- Add multi-GPU test for auto-scale behavior
- Use `--auto_scale=true` consistently in examples and text
- Add note on checkpoint/eval cadence: suggest optionally scaling save_freq/eval_freq by world_size and flag as pending maintainer decision
Copilot AI review requested due to automatic review settings October 20, 2025 05:32
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces an auto-scaling feature for multi-GPU training that automatically adjusts learning rates and training steps when using multiple processes. When enabled via --auto_scale=true, the system multiplies the learning rate by the number of GPUs and divides training steps proportionally to maintain consistent total sample processing and training dynamics.

Key changes:

  • Added auto_scale boolean flag to training configuration with logic to scale LR and steps based on world size
  • Created comprehensive test suite to verify auto-scaling behavior with multi-GPU setups
  • Updated documentation to explain the new auto-scaling feature and its usage

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

File Description
src/lerobot/configs/train.py Adds auto_scale configuration field with documentation
src/lerobot/scripts/lerobot_train.py Implements auto-scaling logic for learning rates and training steps
tests/training/test_auto_scale.py Adds test case for auto-scaling with 2 GPUs
docs/source/multi_gpu_training.mdx Updates documentation to describe auto-scaling feature and usage

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

nijkah and others added 3 commits October 20, 2025 14:37
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Signed-off-by: Hakjin Lee <nijkah@gmail.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Signed-off-by: Hakjin Lee <nijkah@gmail.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Signed-off-by: Hakjin Lee <nijkah@gmail.com>
@nijkah nijkah changed the title [Feature] Auto-scaling for Multi-GPU Training Oct 20, 2025
@imstevenpmwork imstevenpmwork requested a review from pkooij October 20, 2025 08:15
@imstevenpmwork imstevenpmwork added enhancement Suggestions for new features or improvements performance Issues aimed at improving speed or resource usage labels Oct 20, 2025
@pkooij
Copy link
Member

pkooij commented Oct 20, 2025

Hi thank you for this PR, very nice addition. I am very busy this week but I will take a look asap next week!

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement Suggestions for new features or improvements performance Issues aimed at improving speed or resource usage

4 participants