-
Notifications
You must be signed in to change notification settings - Fork 7.9k
Description
Issue Title:
[Bug] Single-Node Multi-GPU Training Fails with NCCL Watchdog Timeout despite minimal batch/accumulation (A800 NVLink)
Issue Body:
🐛 Bug Description
I am encountering a persistent NCCL Watchdog Collective Timeout error when training Qwen3-Next-80B-A3B-Instruct with LoRA on a single-node 4x A800 SXM4 server.
Despite aggressively reducing per_device_batch_size to 1 or 2 and gradient_accumulation_steps to 1, the training process hangs at the first few steps (Step 0 or Step 1) and eventually crashes after the default 10-minute timeout (600000ms), throwing Signal 6 (SIGABRT).
Increasing NCCL_TIMEOUT to 3600 or 7200 seems ineffective or is bypassed, as the watchdog still reports a timeout at exactly 600000ms in some runs, or simply hangs indefinitely.
📜 Error Log
[rank1]:[E1227 23:09:04.570] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=56212, OpType=ALLREDUCE, NumelIn=386531328, NumelOut=386531328, Timeout(ms)=600000) ran for 600068 milliseconds before timing out.
[rank1]:[E1227 23:09:04.571] [PG ID 1 PG GUID 1 Rank 1] failure detected by watchdog at work sequence id: 56212 PG status: last enqueued work: 56212, last completed work: 56211
[rank2]:[E1227 23:09:04.716] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=56212, OpType=ALLREDUCE, NumelIn=385990656, NumelOut=385990656, Timeout(ms)=600000) ...
...
terminate called after throwing an instance of 'c10::DistBackendError'
what(): [PG ID 1 PG GUID 1 Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout...
⚙️ Environment
- Hardware: 4x NVIDIA A800-SXM4-80GB (NVLink enabled)
- OS: Linux (Ubuntu 20.04/22.04)
- CUDA: 12.4
- PyTorch Version: (Check with
torch.__version__, e.g., 2.4.0+cu121) - DeepSpeed Version: 0.16.9
- LLaMA-Factory Version: (Latest or commit ID)
- Driver Version: 550.144.03
📋 Reproduction Steps
- Launch Command:
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
export CUDA_VISIBLE_DEVICES=0,1,2,3
export NCCL_P2P_DISABLE=0 # Tried 0 and 1
export NCCL_IB_DISABLE=0 # Tried 0 and 1
llamafactory-cli webui
- Training Config (WebUI):
- Model: Qwen3-30B-A3B-Instruct (loaded in bf16)
- Method: LoRA (Rank 32 or 128, trainable params ~100M - 400M)
- Stage: DeepSpeed ZeRO-2
- Per-device batch size: 2
- Gradient accumulation: 8 (also tried 1, still crashes)
- Precision: bf16
🕵️ Troubleshooting Tried
- Reduced Load: Lowered batch size to
1and gradient accumulation to1. Error persists. - Reduced Communication: Lowered LoRA Rank to
32. Error persists. - Environment Variables: Set
NCCL_TIMEOUT=7200andNCCL_ASYNC_ERROR_HANDLING=1. The process still hangs or crashes. - Data: Tried with a single small dataset (1k samples). Error persists.
- Memory: GPU memory is sufficient (~60GB/80GB used).
❓ Expected Behavior
Training should proceed past Step 0/Step 1 and log loss metrics, utilizing the NVLink interconnect for efficient communication.
Reproduction
Put your message here.
Others
No response