Multi-GPU Fine-tuning with Unsloth

Learn how to fine-tune LLMs on multiple GPUs and parallelism with Unsloth.

Unsloth currently supports multi-GPU setups through libraries like Accelerate and DeepSpeed. This means you can already leverage parallelism methods such as FSDP and DDP with Unsloth.

See our new Distributed Data Parallel (DDP) multi-GPU Guide here.

We know that the process can be complex and requires manual setup. We’re working hard to make multi-GPU support much simpler and more user-friendly, and we’ll be announcing official multi-GPU support for Unsloth soon.

For now, you can use our Magistral-2509 Kaggle notebook as an example which utilizes multi-GPU Unsloth to fit the 24B parameter model or our DDP guide.

In the meantime, to enable multi GPU for DDP, do the following:

Create your training script as train.py (or similar). For example, you can use one of our training scripts created from our various notebooks!
Run accelerate launch train.py or torchrun --nproc_per_node N_GPUS train.py where N_GPUS is the number of GPUs you have.

Pipeline / model splitting loading

If you do not have enough VRAM for 1 GPU to load say Llama 70B, no worries - we will split the model for you on each GPU! To enable this, use the device_map = "balanced" flag:

from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
    "unsloth/Llama-3.3-70B-Instruct",
    load_in_4bit = True,
    device_map = "balanced",
)

Stay tuned for our official announcement! For more details, check out our ongoing Pull Request discussing multi-GPU support.

PreviousTroubleshooting Inference NextDistributed Data Parallel (DDP)

Last updated 1 month ago

Was this helpful?