模型训练卡死

Reminder

I have read the above rules and searched the existing issues.

System Info

[INFO|2025-12-27 15:11:11] llamafactory.model.model_utils.kv_cache:143 >> KV cache is disabled during training.
[2025-12-27 15:11:12,692] [INFO] [config.py:735:init] Config mesh_device None world_size = 4
[2025-12-27 15:11:12,870] [INFO] [config.py:735:init] Config mesh_device None world_size = 4
[2025-12-27 15:11:12,892] [INFO] [config.py:735:init] Config mesh_device None world_size = 4
[2025-12-27 15:11:12,972] [INFO] [config.py:735:init] Config mesh_device None world_size = 4
[2025-12-27 15:11:52,868] [INFO] [partition_parameters.py:348:exit] finished initializing model - num_params = 74391, num_elems = 79.67B
[INFO|2025-12-27 15:23:18] llamafactory.model.model_utils.checkpointing:143 >> Gradient checkpointing enabled.
[INFO|2025-12-27 15:23:18] llamafactory.model.model_utils.attention:143 >> Using FlashAttention-2 for faster training and inference.
[INFO|2025-12-27 15:23:18] llamafactory.model.adapter:143 >> DeepSpeed ZeRO3 detected, remaining trainable params in float32.
[INFO|2025-12-27 15:23:18] llamafactory.model.adapter:143 >> Fine-tuning method: LoRA
[INFO|2025-12-27 15:23:18] llamafactory.model.model_utils.misc:143 >> Found linear modules: gate_proj,o_proj,in_proj_ba,gate,up_proj,k_proj,down_proj,shared_expert_gate,v_proj,q_proj,in_proj_qkvz,out_proj
[INFO|2025-12-27 15:25:35] llamafactory.model.loader:143 >> trainable params: 6,092,957,184 || all params: 85,767,348,480 || trainable%: 7.1041

Reproduction

CPU GPU似乎都在运行，但是终端或者Web都没有显示训练进度，训练似乎卡住了？ 一整晚 都没反应似乎

Others

No response

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

模型训练卡死 #9679

Reminder

System Info

Reproduction

Others

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

模型训练卡死 #9679

Description

Reminder

System Info

Reproduction

Others

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions