-
Notifications
You must be signed in to change notification settings - Fork 173
vLLM weight scale FP8 and standby override #354
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Summary of ChangesHello @Datta0, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request refines Unsloth's integration with vLLM by adapting to recent modifications in vLLM's FP8 weight scale processing and introducing an explicit override for GPU memory utilization in standby mode. These adjustments aim to maintain correct functionality across different vLLM versions and offer more granular control over system resource management. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces two main changes: it adapts to a recent change in vLLM regarding FP8 weight scale transposition and adds a flag to override the standby GPU utilization setting.
My review focuses on the correctness and robustness of the vLLM adaptation. I've found a potential critical issue in the logic for handling the weight scale transpose, which might lead to incorrect model weights. I've also included a suggestion to improve the robustness of the version detection. Additionally, there are some minor code cleanup suggestions. The standby override feature looks good.
| # https://github.com/vllm-project/vllm/commit/f9a4087182ffcd9404779fcda876f820b3b26d5f#diff-cce58c0ceb6a9b15a01f117d734b93736acc25ed89921c2eacc58ea05bd34d0eL1155-L1157 | ||
| from vllm.model_executor.layers.quantization.utils.fp8_utils import maybe_post_process_fp8_weight_block | ||
| from inspect import getsource | ||
| needs_transpose_check = 'layer.weight_scale.data.T.contiguous()' in getsource(maybe_post_process_fp8_weight_block) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This check using getsource is fragile and can break easily with vllm updates. A version check would be more robust.
More importantly, the logic seems inverted. needs_transpose_check is True for older vllm versions that already perform the transpose. Your code then performs another transpose (on line 948), resulting in a double transpose. The transpose should only happen for newer vllm versions that don't do it.
The condition on line 942 should likely be if not needs_transpose_check:.
For a more robust solution, consider checking the vllm version:
from vllm import __version__ as VLLM_VERSION
from packaging.version import Version
# The transpose was removed in vLLM v0.4.1.
# We need to transpose only if vLLM does not.
should_transpose = Version(VLLM_VERSION) >= Version("0.4.1")Then use if should_transpose: where you currently use if needs_transpose_check:.
| cutlass_block_fp8_supported = torch.ops._C.cutlass_scaled_mm_supports_block_fp8(sm_cap) | ||
| except Exception as e: | ||
| logger.info(f"Unsloth: Could not import vLLM cutlass_block_fp8_supported: {e}") | ||
| pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
vLLM recently removed the transpose of weight scale for Hopper GPUs.
vllm-project/vllm#28431
So now we check if the weight process function does a transpose of weight scale before doing so
vllm-project/vllm@f9a4087#diff-cce58c0ceb6a9b15a01f117d734b93736acc25ed89921c2eacc58ea05bd34d0eL1155-L1157
Also add a flag to override the standby util we set (just for extreme scenarios)