vLLM weight scale FP8 and standby override #354

Datta0 · 2025-11-14T05:22:30Z

vLLM recently removed the transpose of weight scale for Hopper GPUs.
vllm-project/vllm#28431
So now we check if the weight process function does a transpose of weight scale before doing so
vllm-project/vllm@f9a4087#diff-cce58c0ceb6a9b15a01f117d734b93736acc25ed89921c2eacc58ea05bd34d0eL1155-L1157

Also add a flag to override the standby util we set (just for extreme scenarios)

…verride

gemini-code-assist · 2025-11-14T05:22:43Z

Summary of Changes

Hello @Datta0, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request refines Unsloth's integration with vLLM by adapting to recent modifications in vLLM's FP8 weight scale processing and introducing an explicit override for GPU memory utilization in standby mode. These adjustments aim to maintain correct functionality across different vLLM versions and offer more granular control over system resource management.

Highlights

vLLM FP8 Weight Scale Handling: Implemented a dynamic check to determine if vLLM's maybe_post_process_fp8_weight_block function performs a transpose of the weight scale. This ensures compatibility with recent vLLM changes where the transpose for Hopper GPUs was removed, applying the transpose conditionally based on the vLLM version's behavior.
Standby Utility Override: Added a new environment variable, UNSLOTH_VLLM_STANDBY_UTIL_OVERRIDE, which allows users to override the default standby GPU memory utilization. This provides greater flexibility for managing resource allocation in extreme or specific scenarios.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces two main changes: it adapts to a recent change in vLLM regarding FP8 weight scale transposition and adds a flag to override the standby GPU utilization setting.

My review focuses on the correctness and robustness of the vLLM adaptation. I've found a potential critical issue in the logic for handling the weight scale transpose, which might lead to incorrect model weights. I've also included a suggestion to improve the robustness of the version detection. Additionally, there are some minor code cleanup suggestions. The standby override feature looks good.

gemini-code-assist · 2025-11-14T05:24:56Z

unsloth_zoo/vllm_utils.py

+        # https://github.com/vllm-project/vllm/commit/f9a4087182ffcd9404779fcda876f820b3b26d5f#diff-cce58c0ceb6a9b15a01f117d734b93736acc25ed89921c2eacc58ea05bd34d0eL1155-L1157
+        from vllm.model_executor.layers.quantization.utils.fp8_utils import maybe_post_process_fp8_weight_block
+        from inspect import getsource
+        needs_transpose_check = 'layer.weight_scale.data.T.contiguous()' in getsource(maybe_post_process_fp8_weight_block)


This check using getsource is fragile and can break easily with vllm updates. A version check would be more robust.

More importantly, the logic seems inverted. needs_transpose_check is True for older vllm versions that already perform the transpose. Your code then performs another transpose (on line 948), resulting in a double transpose. The transpose should only happen for newer vllm versions that don't do it.

The condition on line 942 should likely be if not needs_transpose_check:.

For a more robust solution, consider checking the vllm version:

from vllm import __version__ as VLLM_VERSION from packaging.version import Version # The transpose was removed in vLLM v0.4.1. # We need to transpose only if vLLM does not. should_transpose = Version(VLLM_VERSION) >= Version("0.4.1")

Then use if should_transpose: where you currently use if needs_transpose_check:.

gemini-code-assist · 2025-11-14T05:24:56Z

unsloth_zoo/vllm_utils.py

+            cutlass_block_fp8_supported = torch.ops._C.cutlass_scaled_mm_supports_block_fp8(sm_cap)
+        except Exception as e:
+            logger.info(f"Unsloth: Could not import vLLM cutlass_block_fp8_supported: {e}")
+        pass


This pass statement is unnecessary and can be removed. The same applies to the pass on line 950.

update weight scale transpose check for newer vllm and standby util o…

f52d05f

…verride

gemini-code-assist bot reviewed Nov 14, 2025

View reviewed changes

danielhanchen merged commit 6d89b0e into unslothai:main Nov 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

vLLM weight scale FP8 and standby override #354

vLLM weight scale FP8 and standby override #354

Uh oh!

Datta0 commented Nov 14, 2025

gemini-code-assist bot commented Nov 14, 2025

gemini-code-assist bot left a comment

gemini-code-assist bot Nov 14, 2025

gemini-code-assist bot Nov 14, 2025

Labels

2 participants

vLLM weight scale FP8 and standby override #354

vLLM weight scale FP8 and standby override #354

Uh oh!

Conversation

Datta0 commented Nov 14, 2025

gemini-code-assist bot commented Nov 14, 2025

Summary of Changes

Highlights

Footnotes

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

gemini-code-assist bot Nov 14, 2025

Choose a reason for hiding this comment

gemini-code-assist bot Nov 14, 2025

Choose a reason for hiding this comment

Labels

2 participants