Skip to content

Conversation

@hhaAndroid
Copy link
Collaborator

No description provided.

@hhaAndroid hhaAndroid requested a review from LZHgrla March 11, 2024 07:52
@choyakawa
Copy link

Not working with zero3: #432 (comment)

@hhaAndroid
Copy link
Collaborator Author

Not working with zero3: #432 (comment)

qlora does not currently support zero3.

@choyakawa
Copy link

Not working with zero3: #432 (comment)

qlora does not currently support zero3.

It is not the issue with 4bit. I used full and no lora, however the 'newline' is somehow not compatible with Zero3.

@awzhgw
Copy link

awzhgw commented Apr 24, 2024

@hhaAndroid @tpoisonooo 这个PR啥时候合并呢?我急需

@awzhgw
Copy link

awzhgw commented Apr 24, 2024

@hhaAndroid

当我开始pretrain的时候,报错了。。这是什么错误呢?

RuntimeError: The expanded size of the tensor (4096) must match the existing size (0) at non-singleton dimension 0.  Target sizes: [4096, 32, 1].  Tensor sizes: [0, 1, 1]
    model = self.train_loop.run()  # type: ignore
  File "/usr/local/lib/python3.10/dist-packages/mmengine/runner/loops.py", line 270, in run
    self.runner.call_hook('before_train')
  File "/usr/local/lib/python3.10/dist-packages/mmengine/runner/_flexible_runner.py", line 1271, in call_hook
    getattr(hook, fn_name)(self, **kwargs)
  File "/export/App/training_platform/PinoModel/xtuner/xtuner/engine/hooks/evaluate_chat_hook.py", line 221, in before_train
    self._generate_samples(runner, max_new_tokens=50)
  File "/export/App/training_platform/PinoModel/xtuner/xtuner/engine/hooks/evaluate_chat_hook.py", line 207, in _generate_samples
    self._eval_images(runner, model, device, max_new_tokens,
  File "/export/App/training_platform/PinoModel/xtuner/xtuner/engine/hooks/anyshape_evaluate_chat_hook.py", line 53, in _eval_images
    image_features = model.preprocess_for_pixel_values({
  File "/export/App/training_platform/PinoModel/xtuner/xtuner/model/anyshape_llava.py", line 109, in preprocess_for_pixel_values
    self.image_newline[:, None, None].expand(
RuntimeError: The expanded size of the tensor (4096) must match the existing size (0) at non-singleton dimension 0.  Target sizes: [4096, 32, 1].  Tensor sizes: [0, 1, 1]
[2024-04-24 13:52:01,685] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 2444983) of binary: /usr/bin/python
Traceback (most recent call last):
  File "/usr/local/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 806, in main
    run(args)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 797, in run
    elastic_launch(
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
@awzhgw
Copy link

awzhgw commented Apr 24, 2024

@hhaAndroid this pr can support llama3 8B ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

3 participants