-
Notifications
You must be signed in to change notification settings - Fork 7.9k
Description
Reminder
- I have read the above rules and searched the existing issues.
System Info
环境情况:
torch:2.9.0
transformers:4.57.1
gptqmodel:5.6.10
配置文件:
model_name_or_path: /root/work1/GLM4.6V
trust_remote_code: true
method
stage: sft
do_train: true
finetuning_type: lora
lora_rank: 8
lora_target: all
dataset
dataset: det_vehicle
template: glm4_5v
cutoff_len: 2048
max_samples: 100000
overwrite_cache: true
preprocessing_num_workers: 16
dataloader_num_workers: 4
output
output_dir: saves/glm46v/lora
logging_steps: 10
save_steps: 200
plot_loss: true
overwrite_output_dir: true
save_only_model: false
report_to: none # choices: [none, wandb, tensorboard, swanlab, mlflow]
train
per_device_train_batch_size: 1
gradient_accumulation_steps: 8
learning_rate: 1.0e-4
num_train_epochs: 3
lr_scheduler_type: cosine
warmup_ratio: 0.1
#bf16: true
fp16: false
ddp_timeout: 180000000
resume_from_checkpoint: null
Reproduction
问题:
[rank0]:[W1229 11:05:16.278463469 ProcessGroupNCCL.cpp:5072] Guessing device ID based on global rank. This can cause a hang if rank to GPU mapping is heterogeneous. You can specify device_id in init_process_group()
R[rank0]: multiprocess.pool.RemoteTraceback: 0%| | 0/64115 [00:03<?, ? examples/s]
[rank0]: """
[rank0]: Traceback (most recent call last):
[rank0]: File "/root/miniconda3/envs/llm_factory/lib/python3.12/site-packages/multiprocess/pool.py", line 125, in worker
[rank0]: result = (True, func(*args, **kwds))
[rank0]: ^^^^^^^^^^^^^^^^^^^
[rank0]: File "/root/miniconda3/envs/llm_factory/lib/python3.12/site-packages/datasets/utils/py_utils.py", line 586, in _write_generator_to_queue
[rank0]: for i, result in enumerate(func(**kwargs)):
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/root/miniconda3/envs/llm_factory/lib/python3.12/site-packages/datasets/arrow_dataset.py", line 3674, in _map_single
[rank0]: for i, batch in iter_outputs(shard_iterable):
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/root/miniconda3/envs/llm_factory/lib/python3.12/site-packages/datasets/arrow_dataset.py", line 3624, in iter_outputs
[rank0]: yield i, apply_function(example, i, offset=offset)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/root/miniconda3/envs/llm_factory/lib/python3.12/site-packages/datasets/arrow_dataset.py", line 3547, in apply_function
[rank0]: processed_inputs = function(*fn_args, *additional_args, **fn_kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/root/miniconda3/envs/llm_factory/lib/python3.12/site-packages/llamafactory/data/processor/supervised.py", line 99, in preprocess_dataset
[rank0]: input_ids, labels = self._encode_data_example(
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/root/miniconda3/envs/llm_factory/lib/python3.12/site-packages/llamafactory/data/processor/supervised.py", line 43, in _encode_data_example
[rank0]: messages = self.template.mm_plugin.process_messages(prompt + response, images, videos, audios, self.processor)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/root/miniconda3/envs/llm_factory/lib/python3.12/site-packages/llamafactory/data/mm_plugin.py", line 1758, in process_messages
[rank0]: self._validate_input(processor, images, videos, audios)
[rank0]: File "/root/miniconda3/envs/llm_factory/lib/python3.12/site-packages/llamafactory/data/mm_plugin.py", line 189, in _validate_input
[rank0]: raise ValueError("Processor was not found, please check and update your model file.")
[rank0]: ValueError: Processor was not found, please check and update your model file.
[rank0]: """
[rank0]: The above exception was the direct cause of the following exception:
[rank0]: Traceback (most recent call last):
[rank0]: File "/root/miniconda3/envs/llm_factory/lib/python3.12/site-packages/llamafactory/launcher.py", line 185, in
[rank0]: run_exp()
[rank0]: File "/root/miniconda3/envs/llm_factory/lib/python3.12/site-packages/llamafactory/train/tuner.py", line 132, in run_exp
[rank0]: _training_function(config={"args": args, "callbacks": callbacks})
[rank0]: File "/root/miniconda3/envs/llm_factory/lib/python3.12/site-packages/llamafactory/train/tuner.py", line 93, in _training_function
[rank0]: run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
[rank0]: File "/root/miniconda3/envs/llm_factory/lib/python3.12/site-packages/llamafactory/train/sft/workflow.py", line 52, in run_sft
[rank0]: dataset_module = get_dataset(template, model_args, data_args, training_args, stage="sft", **tokenizer_module)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/root/miniconda3/envs/llm_factory/lib/python3.12/site-packages/llamafactory/data/loader.py", line 314, in get_dataset
[rank0]: dataset = _get_preprocessed_dataset(
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/root/miniconda3/envs/llm_factory/lib/python3.12/site-packages/llamafactory/data/loader.py", line 255, in _get_preprocessed_dataset
[rank0]: dataset = dataset.map(
[rank0]: ^^^^^^^^^^^^
[rank0]: File "/root/miniconda3/envs/llm_factory/lib/python3.12/site-packages/datasets/arrow_dataset.py", line 560, in wrapper
[rank0]: out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/root/miniconda3/envs/llm_factory/lib/python3.12/site-packages/datasets/arrow_dataset.py", line 3309, in map
[rank0]: for rank, done, content in iflatmap_unordered(
[rank0]: ^^^^^^^^^^^^^^^^^^^
[rank0]: File "/root/miniconda3/envs/llm_factory/lib/python3.12/site-packages/datasets/utils/py_utils.py", line 626, in iflatmap_unordered
[rank0]: [async_result.get(timeout=0.05) for async_result in async_results]
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/root/miniconda3/envs/llm_factory/lib/python3.12/site-packages/multiprocess/pool.py", line 774, in get
[rank0]: raise self._value
[rank0]: ValueError: Processor was not found, please check and update your model file.
Others
No response