Skip to content

正常训练,突然报bug,raise SignalException(f"Process {os.getpid()} got signal: {sigval}", sigval=sigval) #9695

@zkailinzhang

Description

@zkailinzhang

Reminder

  • I have read the above rules and searched the existing issues.

System Info

8卡 A100 微调qwen3vl-think-32b

Reproduction

Put your message here.
FORCE_TORCHRUN=1 CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 nohup llamafactory-cli train examples/train_lora/qwen3_vl_think_lora_sft.yaml > train_qwen3vl_think_1229.log 2>&1 &

Others

{'loss': 0.0588, 'grad_norm': 0.6597367525100708, 'learning_rate': 4.399193785696366e-06, 'epoch': 23.39}

58%|█████▊ | 1661/2840 [6:59:01<4:53:05, 14.92s/it]
59%|█████▊ | 1662/2840 [6:59:16<4:51:54, 14.87s/it]

{'loss': 0.0554, 'grad_norm': 0.6498427391052246, 'learning_rate': 4.393093243607054e-06, 'epoch': 23.41}

59%|█████▊ | 1662/2840 [6:59:16<4:51:54, 14.87s/it]
59%|█████▊ | 1663/2840 [6:59:31<4:51:53, 14.88s/it]

{'loss': 0.0605, 'grad_norm': 0.6059134602546692, 'learning_rate': 4.386993618371275e-06, 'epoch': 23.42}

59%|█████▊ | 1663/2840 [6:59:31<4:51:53, 14.88s/it]W1229 16:43:25.650000 190834 /appdata//envs/llamafactory/lib/python3.12/site-packages/torch/distributed/elastic/agent/server/api.py:723] Received 1 death signal, shutting down workers
W1229 16:43:25.654000 190834 /appdata//envs/llamafactory/lib/python3.12/site-packages/torch/distributed/elastic/multiprocessing/api.py:900] Sending process 190908 closing signal SIGHUP
W1229 16:43:25.655000 190834 /appdata//envs/llamafactory/lib/python3.12/site-packages/torch/distributed/elastic/multiprocessing/api.py:900] Sending process 190909 closing signal SIGHUP
W1229 16:43:25.660000 190834 /appdata//envs/llamafactory/lib/python3.12/site-packages/torch/distributed/elastic/multiprocessing/api.py:900] Sending process 190910 closing signal SIGHUP
W1229 16:43:25.676000 190834 /appdata//envs/llamafactory/lib/python3.12/site-packages/torch/distributed/elastic/multiprocessing/api.py:900] Sending process 190911 closing signal SIGHUP
W1229 16:43:25.679000 190834 /appdata//envs/llamafactory/lib/python3.12/site-packages/torch/distributed/elastic/multiprocessing/api.py:900] Sending process 190912 closing signal SIGHUP
W1229 16:43:25.682000 190834 /appdata//envs/llamafactory/lib/python3.12/site-packages/torch/distributed/elastic/multiprocessing/api.py:900] Sending process 190914 closing signal SIGHUP
W1229 16:43:25.687000 190834 /appdata//envs/llamafactory/lib/python3.12/site-packages/torch/distributed/elastic/multiprocessing/api.py:900] Sending process 190915 closing signal SIGHUP
W1229 16:43:25.692000 190834 /appdata//envs/llamafactory/lib/python3.12/site-packages/torch/distributed/elastic/multiprocessing/api.py:900] Sending process 190916 closing signal SIGHUP
Traceback (most recent call last):
File "/home//.conda/envs/llamafactory/bin/torchrun", line 7, in
sys.exit(main())
^^^^^^
File "/home//.conda/envs/llamafactory/lib/python3.12/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 357, in wrapper
return f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/home//.conda/envs/llamafactory/lib/python3.12/site-packages/torch/distributed/run.py", line 901, in main
run(args)
File "/home//.conda/envs/llamafactory/lib/python3.12/site-packages/torch/distributed/run.py", line 892, in run
elastic_launch(
File "/home//.conda/envs/llamafactory/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 143, in call
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home//.conda/envs/llamafactory/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
result = agent.run()
^^^^^^^^^^^
File "/home//.conda/envs/llamafactory/lib/python3.12/site-packages/torch/distributed/elastic/metrics/api.py", line 138, in wrapper
result = f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/home//.conda/envs/llamafactory/lib/python3.12/site-packages/torch/distributed/elastic/agent/server/api.py", line 715, in run
result = self._invoke_run(role)
^^^^^^^^^^^^^^^^^^^^^^
File "/home//.conda/envs/llamafactory/lib/python3.12/site-packages/torch/distributed/elastic/agent/server/api.py", line 879, in _invoke_run
time.sleep(monitor_interval)
File "/home//.conda/envs/llamafactory/lib/python3.12/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 84, in _terminate_process_handler
raise SignalException(f"Process {os.getpid()} got signal: {sigval}", sigval=sigval)
torch.distributed.elastic.multiprocessing.api.SignalException: Process 190834 got signal: 1
Traceback (most recent call last):
File "/home//.conda/envs/llamafactory/bin/llamafactory-cli", line 7, in
sys.exit(main())
^^^^^^
File "/appdata//LLaMA-Factory/src/llamafactory/cli.py", line 24, in main
launcher.launch()
File "/appdata//LLaMA-Factory/src/llamafactory/launcher.py", line 115, in launch
process = subprocess.run(
^^^^^^^^^^^^^^^
File "/home//.conda/envs/llamafactory/lib/python3.12/subprocess.py", line 571, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['torchrun', '--nnodes', '1', '--node_rank', '0', '--nproc_per_node', '8', '--master_addr', '127.0.0.1', '--master_port', '56569', '/appdata/zhangkailin/LLaMA-Factory/src/llamafactory/launcher.py', 'examples/train_lora/qwen3_vl_think_lora_sft_genn_ok.yaml']' returned non-zero exit status 1.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingpendingThis problem is yet to be addressed

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions