Skip to content

[Bug] Single-Node Multi-GPU Training Fails with NCCL Watchdog Timeout despite minimal batch/accumulation (A800 NVLink) #9681

@noob000007

Description

@noob000007

Issue Title:
[Bug] Single-Node Multi-GPU Training Fails with NCCL Watchdog Timeout despite minimal batch/accumulation (A800 NVLink)

Issue Body:

🐛 Bug Description

I am encountering a persistent NCCL Watchdog Collective Timeout error when training Qwen3-Next-80B-A3B-Instruct with LoRA on a single-node 4x A800 SXM4 server.

Despite aggressively reducing per_device_batch_size to 1 or 2 and gradient_accumulation_steps to 1, the training process hangs at the first few steps (Step 0 or Step 1) and eventually crashes after the default 10-minute timeout (600000ms), throwing Signal 6 (SIGABRT).

Increasing NCCL_TIMEOUT to 3600 or 7200 seems ineffective or is bypassed, as the watchdog still reports a timeout at exactly 600000ms in some runs, or simply hangs indefinitely.

📜 Error Log

[rank1]:[E1227 23:09:04.570] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=56212, OpType=ALLREDUCE, NumelIn=386531328, NumelOut=386531328, Timeout(ms)=600000) ran for 600068 milliseconds before timing out.
[rank1]:[E1227 23:09:04.571] [PG ID 1 PG GUID 1 Rank 1] failure detected by watchdog at work sequence id: 56212 PG status: last enqueued work: 56212, last completed work: 56211
[rank2]:[E1227 23:09:04.716] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=56212, OpType=ALLREDUCE, NumelIn=385990656, NumelOut=385990656, Timeout(ms)=600000) ...
...
terminate called after throwing an instance of 'c10::DistBackendError'
  what():  [PG ID 1 PG GUID 1 Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout...

⚙️ Environment

  • Hardware: 4x NVIDIA A800-SXM4-80GB (NVLink enabled)
  • OS: Linux (Ubuntu 20.04/22.04)
  • CUDA: 12.4
  • PyTorch Version: (Check with torch.__version__, e.g., 2.4.0+cu121)
  • DeepSpeed Version: 0.16.9
  • LLaMA-Factory Version: (Latest or commit ID)
  • Driver Version: 550.144.03

📋 Reproduction Steps

  1. Launch Command:
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
export CUDA_VISIBLE_DEVICES=0,1,2,3
export NCCL_P2P_DISABLE=0  # Tried 0 and 1
export NCCL_IB_DISABLE=0   # Tried 0 and 1
llamafactory-cli webui
  1. Training Config (WebUI):
  • Model: Qwen3-30B-A3B-Instruct (loaded in bf16)
  • Method: LoRA (Rank 32 or 128, trainable params ~100M - 400M)
  • Stage: DeepSpeed ZeRO-2
  • Per-device batch size: 2
  • Gradient accumulation: 8 (also tried 1, still crashes)
  • Precision: bf16

🕵️ Troubleshooting Tried

  1. Reduced Load: Lowered batch size to 1 and gradient accumulation to 1. Error persists.
  2. Reduced Communication: Lowered LoRA Rank to 32. Error persists.
  3. Environment Variables: Set NCCL_TIMEOUT=7200 and NCCL_ASYNC_ERROR_HANDLING=1. The process still hangs or crashes.
  4. Data: Tried with a single small dataset (1k samples). Error persists.
  5. Memory: GPU memory is sufficient (~60GB/80GB used).

❓ Expected Behavior

Training should proceed past Step 0/Step 1 and log loss metrics, utilizing the NVLink interconnect for efficient communication.

Reproduction

Put your message here.

Others

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingpendingThis problem is yet to be addressed

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions