feat(train): add auto-scaling option for Multi-GPU Training #2254

nijkah · 2025-10-20T05:32:32Z

What this does

This PR adds an opt-in auto-scaling feature for distributed training with Accelerate. When enabled via --auto_scale=true and running with multiple processes (GPUs), LeRobot:

Multiplies the learning rate by world size (linear LR scaling).
Divides the number of training steps by world size (ceil-div).

This keeps total sample count roughly comparable and preserves training dynamics while improving throughput.

How it was tested

Examples:

accelerate launch \
  --multi_gpu \
  --num_processes=2 \
  $(which lerobot-train) \
  --dataset.repo_id=${HF_USER}/my_dataset \
  --policy.type=act \
  --policy.repo_id=${HF_USER}/my_trained_policy \
  --output_dir=outputs/train/act_multi_gpu \
  --job_name=act_multi_gpu \
  --wandb.enable=true \
  --auto_scale=true

Tested

A6000 x 2 GPU training

- Add `auto_scale` flag to TrainPipelineConfig - Scale optimizer LR by world size and divide total steps accordingly when using Accelerate with multiple GPUs - Apply scaling before optimizer/scheduler creation and before logging config - Update multi-GPU docs with `--auto_scale=true` usage and explanation - Add multi-GPU test for auto-scale behavior

- Use `--auto_scale=true` consistently in examples and text - Add note on checkpoint/eval cadence: suggest optionally scaling save_freq/eval_freq by world_size and flag as pending maintainer decision

Copilot

Pull Request Overview

This PR introduces an auto-scaling feature for multi-GPU training that automatically adjusts learning rates and training steps when using multiple processes. When enabled via --auto_scale=true, the system multiplies the learning rate by the number of GPUs and divides training steps proportionally to maintain consistent total sample processing and training dynamics.

Key changes:

Added auto_scale boolean flag to training configuration with logic to scale LR and steps based on world size
Created comprehensive test suite to verify auto-scaling behavior with multi-GPU setups
Updated documentation to explain the new auto-scaling feature and its usage

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

File	Description
src/lerobot/configs/train.py	Adds `auto_scale` configuration field with documentation
src/lerobot/scripts/lerobot_train.py	Implements auto-scaling logic for learning rates and training steps
tests/training/test_auto_scale.py	Adds test case for auto-scaling with 2 GPUs
docs/source/multi_gpu_training.mdx	Updates documentation to describe auto-scaling feature and usage

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

src/lerobot/scripts/lerobot_train.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: Hakjin Lee <nijkah@gmail.com>

pkooij · 2025-10-20T11:36:50Z

Hi thank you for this PR, very nice addition. I am very busy this week but I will take a look asap next week!

HuggingFaceDocBuilderDev · 2025-10-27T07:45:27Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

nijkah added 3 commits October 20, 2025 13:37

docs(multi-gpu): document --auto_scale flag and cadence note

33a0faf

- Use `--auto_scale=true` consistently in examples and text - Add note on checkpoint/eval cadence: suggest optionally scaling save_freq/eval_freq by world_size and flag as pending maintainer decision

feat(train): apply LR scaling to policy

b9011ed

Copilot AI review requested due to automatic review settings October 20, 2025 05:32

Copilot AI reviewed Oct 20, 2025

View reviewed changes

src/lerobot/scripts/lerobot_train.py Show resolved Hide resolved

src/lerobot/scripts/lerobot_train.py Outdated Show resolved Hide resolved

src/lerobot/scripts/lerobot_train.py Outdated Show resolved Hide resolved

nijkah and others added 3 commits October 20, 2025 14:37

Use explicit warning for the attribution error

77c5fc4

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: Hakjin Lee <nijkah@gmail.com>

Update src/lerobot/scripts/lerobot_train.py

d1dceb9

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: Hakjin Lee <nijkah@gmail.com>

Update src/lerobot/scripts/lerobot_train.py

ed6f84f

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: Hakjin Lee <nijkah@gmail.com>

nijkah changed the title ~~[Feature] Auto-scaling for Multi-GPU Training~~ Oct 20, 2025

imstevenpmwork requested a review from pkooij October 20, 2025 08:15

imstevenpmwork added enhancement Suggestions for new features or improvements performance Issues aimed at improving speed or resource usage labels Oct 20, 2025

pkooij assigned nijkah Oct 20, 2025

Merge branch 'main' into feature/auto_scale

2677421

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(train): add auto-scaling option for Multi-GPU Training #2254

feat(train): add auto-scaling option for Multi-GPU Training #2254

nijkah commented Oct 20, 2025 •

edited

Loading

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

pkooij commented Oct 20, 2025

HuggingFaceDocBuilderDev commented Oct 27, 2025

Labels

4 participants

feat(train): add auto-scaling option for Multi-GPU Training #2254

Are you sure you want to change the base?

feat(train): add auto-scaling option for Multi-GPU Training #2254

Conversation

nijkah commented Oct 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this does

How it was tested

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

pkooij commented Oct 20, 2025

HuggingFaceDocBuilderDev commented Oct 27, 2025

Labels

4 participants

nijkah commented Oct 20, 2025 •

edited

Loading