[Bug] Single-Node Multi-GPU Training Fails with NCCL Watchdog Timeout despite minimal batch/accumulation (A800 NVLink)

Issue Title:
[Bug] Single-Node Multi-GPU Training Fails with NCCL Watchdog Timeout despite minimal batch/accumulation (A800 NVLink)

Issue Body:

🐛 Bug Description

I am encountering a persistent NCCL Watchdog Collective Timeout error when training Qwen3-Next-80B-A3B-Instruct with LoRA on a single-node 4x A800 SXM4 server.

Despite aggressively reducing per_device_batch_size to 1 or 2 and gradient_accumulation_steps to 1, the training process hangs at the first few steps (Step 0 or Step 1) and eventually crashes after the default 10-minute timeout (600000ms), throwing Signal 6 (SIGABRT).

Increasing NCCL_TIMEOUT to 3600 or 7200 seems ineffective or is bypassed, as the watchdog still reports a timeout at exactly 600000ms in some runs, or simply hangs indefinitely.

📜 Error Log

[rank1]:[E1227 23:09:04.570] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=56212, OpType=ALLREDUCE, NumelIn=386531328, NumelOut=386531328, Timeout(ms)=600000) ran for 600068 milliseconds before timing out.
[rank1]:[E1227 23:09:04.571] [PG ID 1 PG GUID 1 Rank 1] failure detected by watchdog at work sequence id: 56212 PG status: last enqueued work: 56212, last completed work: 56211
[rank2]:[E1227 23:09:04.716] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=56212, OpType=ALLREDUCE, NumelIn=385990656, NumelOut=385990656, Timeout(ms)=600000) ...
...
terminate called after throwing an instance of 'c10::DistBackendError'
  what():  [PG ID 1 PG GUID 1 Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout...

⚙️ Environment

Hardware: 4x NVIDIA A800-SXM4-80GB (NVLink enabled)
OS: Linux (Ubuntu 20.04/22.04)
CUDA: 12.4
PyTorch Version: (Check with torch.__version__, e.g., 2.4.0+cu121)
DeepSpeed Version: 0.16.9
LLaMA-Factory Version: (Latest or commit ID)
Driver Version: 550.144.03

📋 Reproduction Steps

Launch Command:

export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
export CUDA_VISIBLE_DEVICES=0,1,2,3
export NCCL_P2P_DISABLE=0  # Tried 0 and 1
export NCCL_IB_DISABLE=0   # Tried 0 and 1
llamafactory-cli webui

Training Config (WebUI):

Model: Qwen3-30B-A3B-Instruct (loaded in bf16)
Method: LoRA (Rank 32 or 128, trainable params ~100M - 400M)
Stage: DeepSpeed ZeRO-2
Per-device batch size: 2
Gradient accumulation: 8 (also tried 1, still crashes)
Precision: bf16

🕵️ Troubleshooting Tried

Reduced Load: Lowered batch size to 1 and gradient accumulation to 1. Error persists.
Reduced Communication: Lowered LoRA Rank to 32. Error persists.
Environment Variables: Set NCCL_TIMEOUT=7200 and NCCL_ASYNC_ERROR_HANDLING=1. The process still hangs or crashes.
Data: Tried with a single small dataset (1k samples). Error persists.
Memory: GPU memory is sufficient (~60GB/80GB used).

❓ Expected Behavior

Training should proceed past Step 0/Step 1 and log loss metrics, utilizing the NVLink interconnect for efficient communication.

Reproduction

Put your message here.

Others

No response

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug] Single-Node Multi-GPU Training Fails with NCCL Watchdog Timeout despite minimal batch/accumulation (A800 NVLink) #9681

🐛 Bug Description

📜 Error Log

⚙️ Environment

📋 Reproduction Steps

🕵️ Troubleshooting Tried

❓ Expected Behavior

Reproduction

Others

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

[Bug] Single-Node Multi-GPU Training Fails with NCCL Watchdog Timeout despite minimal batch/accumulation (A800 NVLink) #9681

Description

🐛 Bug Description

📜 Error Log

⚙️ Environment

📋 Reproduction Steps

🕵️ Troubleshooting Tried

❓ Expected Behavior

Reproduction

Others

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions