Skip to content

glm4.6V模型训练之后报模型文件错误 #9691

@yangqiheng2019

Description

@yangqiheng2019

Reminder

  • I have read the above rules and searched the existing issues.

System Info

环境情况:
torch:2.9.0
transformers:4.57.1
gptqmodel:5.6.10

配置文件:
model_name_or_path: /root/work1/GLM4.6V
trust_remote_code: true

method

stage: sft
do_train: true
finetuning_type: lora
lora_rank: 8
lora_target: all

dataset

dataset: det_vehicle
template: glm4_5v
cutoff_len: 2048
max_samples: 100000
overwrite_cache: true
preprocessing_num_workers: 16
dataloader_num_workers: 4

output

output_dir: saves/glm46v/lora
logging_steps: 10
save_steps: 200
plot_loss: true
overwrite_output_dir: true
save_only_model: false
report_to: none # choices: [none, wandb, tensorboard, swanlab, mlflow]

train

per_device_train_batch_size: 1
gradient_accumulation_steps: 8
learning_rate: 1.0e-4
num_train_epochs: 3
lr_scheduler_type: cosine
warmup_ratio: 0.1
#bf16: true
fp16: false
ddp_timeout: 180000000
resume_from_checkpoint: null

Reproduction

问题:
[rank0]:[W1229 11:05:16.278463469 ProcessGroupNCCL.cpp:5072] Guessing device ID based on global rank. This can cause a hang if rank to GPU mapping is heterogeneous. You can specify device_id in init_process_group()
R[rank0]: multiprocess.pool.RemoteTraceback: 0%| | 0/64115 [00:03<?, ? examples/s]
[rank0]: """
[rank0]: Traceback (most recent call last):
[rank0]: File "/root/miniconda3/envs/llm_factory/lib/python3.12/site-packages/multiprocess/pool.py", line 125, in worker
[rank0]: result = (True, func(*args, **kwds))
[rank0]: ^^^^^^^^^^^^^^^^^^^
[rank0]: File "/root/miniconda3/envs/llm_factory/lib/python3.12/site-packages/datasets/utils/py_utils.py", line 586, in _write_generator_to_queue
[rank0]: for i, result in enumerate(func(**kwargs)):
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/root/miniconda3/envs/llm_factory/lib/python3.12/site-packages/datasets/arrow_dataset.py", line 3674, in _map_single
[rank0]: for i, batch in iter_outputs(shard_iterable):
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/root/miniconda3/envs/llm_factory/lib/python3.12/site-packages/datasets/arrow_dataset.py", line 3624, in iter_outputs
[rank0]: yield i, apply_function(example, i, offset=offset)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/root/miniconda3/envs/llm_factory/lib/python3.12/site-packages/datasets/arrow_dataset.py", line 3547, in apply_function
[rank0]: processed_inputs = function(*fn_args, *additional_args, **fn_kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/root/miniconda3/envs/llm_factory/lib/python3.12/site-packages/llamafactory/data/processor/supervised.py", line 99, in preprocess_dataset
[rank0]: input_ids, labels = self._encode_data_example(
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/root/miniconda3/envs/llm_factory/lib/python3.12/site-packages/llamafactory/data/processor/supervised.py", line 43, in _encode_data_example
[rank0]: messages = self.template.mm_plugin.process_messages(prompt + response, images, videos, audios, self.processor)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/root/miniconda3/envs/llm_factory/lib/python3.12/site-packages/llamafactory/data/mm_plugin.py", line 1758, in process_messages
[rank0]: self._validate_input(processor, images, videos, audios)
[rank0]: File "/root/miniconda3/envs/llm_factory/lib/python3.12/site-packages/llamafactory/data/mm_plugin.py", line 189, in _validate_input
[rank0]: raise ValueError("Processor was not found, please check and update your model file.")
[rank0]: ValueError: Processor was not found, please check and update your model file.
[rank0]: """

[rank0]: The above exception was the direct cause of the following exception:

[rank0]: Traceback (most recent call last):
[rank0]: File "/root/miniconda3/envs/llm_factory/lib/python3.12/site-packages/llamafactory/launcher.py", line 185, in
[rank0]: run_exp()
[rank0]: File "/root/miniconda3/envs/llm_factory/lib/python3.12/site-packages/llamafactory/train/tuner.py", line 132, in run_exp
[rank0]: _training_function(config={"args": args, "callbacks": callbacks})
[rank0]: File "/root/miniconda3/envs/llm_factory/lib/python3.12/site-packages/llamafactory/train/tuner.py", line 93, in _training_function
[rank0]: run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
[rank0]: File "/root/miniconda3/envs/llm_factory/lib/python3.12/site-packages/llamafactory/train/sft/workflow.py", line 52, in run_sft
[rank0]: dataset_module = get_dataset(template, model_args, data_args, training_args, stage="sft", **tokenizer_module)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/root/miniconda3/envs/llm_factory/lib/python3.12/site-packages/llamafactory/data/loader.py", line 314, in get_dataset
[rank0]: dataset = _get_preprocessed_dataset(
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/root/miniconda3/envs/llm_factory/lib/python3.12/site-packages/llamafactory/data/loader.py", line 255, in _get_preprocessed_dataset
[rank0]: dataset = dataset.map(
[rank0]: ^^^^^^^^^^^^
[rank0]: File "/root/miniconda3/envs/llm_factory/lib/python3.12/site-packages/datasets/arrow_dataset.py", line 560, in wrapper
[rank0]: out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/root/miniconda3/envs/llm_factory/lib/python3.12/site-packages/datasets/arrow_dataset.py", line 3309, in map
[rank0]: for rank, done, content in iflatmap_unordered(
[rank0]: ^^^^^^^^^^^^^^^^^^^
[rank0]: File "/root/miniconda3/envs/llm_factory/lib/python3.12/site-packages/datasets/utils/py_utils.py", line 626, in iflatmap_unordered
[rank0]: [async_result.get(timeout=0.05) for async_result in async_results]
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/root/miniconda3/envs/llm_factory/lib/python3.12/site-packages/multiprocess/pool.py", line 774, in get
[rank0]: raise self._value
[rank0]: ValueError: Processor was not found, please check and update your model file.

Others

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingpendingThis problem is yet to be addressed

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions