Skip to content

模型训练卡死 #9679

@noob000007

Description

@noob000007

Reminder

  • I have read the above rules and searched the existing issues.

System Info

[INFO|2025-12-27 15:11:11] llamafactory.model.model_utils.kv_cache:143 >> KV cache is disabled during training.
[2025-12-27 15:11:12,692] [INFO] [config.py:735:init] Config mesh_device None world_size = 4
[2025-12-27 15:11:12,870] [INFO] [config.py:735:init] Config mesh_device None world_size = 4
[2025-12-27 15:11:12,892] [INFO] [config.py:735:init] Config mesh_device None world_size = 4
[2025-12-27 15:11:12,972] [INFO] [config.py:735:init] Config mesh_device None world_size = 4
[2025-12-27 15:11:52,868] [INFO] [partition_parameters.py:348:exit] finished initializing model - num_params = 74391, num_elems = 79.67B
[INFO|2025-12-27 15:23:18] llamafactory.model.model_utils.checkpointing:143 >> Gradient checkpointing enabled.
[INFO|2025-12-27 15:23:18] llamafactory.model.model_utils.attention:143 >> Using FlashAttention-2 for faster training and inference.
[INFO|2025-12-27 15:23:18] llamafactory.model.adapter:143 >> DeepSpeed ZeRO3 detected, remaining trainable params in float32.
[INFO|2025-12-27 15:23:18] llamafactory.model.adapter:143 >> Fine-tuning method: LoRA
[INFO|2025-12-27 15:23:18] llamafactory.model.model_utils.misc:143 >> Found linear modules: gate_proj,o_proj,in_proj_ba,gate,up_proj,k_proj,down_proj,shared_expert_gate,v_proj,q_proj,in_proj_qkvz,out_proj
[INFO|2025-12-27 15:25:35] llamafactory.model.loader:143 >> trainable params: 6,092,957,184 || all params: 85,767,348,480 || trainable%: 7.1041


Image

Reproduction

CPU GPU似乎都在运行,但是终端或者Web都没有显示训练进度,训练似乎卡住了? 一整晚 都没反应似乎

Others

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingpendingThis problem is yet to be addressed

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions