Skip to content

Conversation

@Datta0
Copy link
Collaborator

@Datta0 Datta0 commented Nov 14, 2025

vLLM recently removed the transpose of weight scale for Hopper GPUs.
vllm-project/vllm#28431
So now we check if the weight process function does a transpose of weight scale before doing so
vllm-project/vllm@f9a4087#diff-cce58c0ceb6a9b15a01f117d734b93736acc25ed89921c2eacc58ea05bd34d0eL1155-L1157

Also add a flag to override the standby util we set (just for extreme scenarios)

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @Datta0, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request refines Unsloth's integration with vLLM by adapting to recent modifications in vLLM's FP8 weight scale processing and introducing an explicit override for GPU memory utilization in standby mode. These adjustments aim to maintain correct functionality across different vLLM versions and offer more granular control over system resource management.

Highlights

  • vLLM FP8 Weight Scale Handling: Implemented a dynamic check to determine if vLLM's maybe_post_process_fp8_weight_block function performs a transpose of the weight scale. This ensures compatibility with recent vLLM changes where the transpose for Hopper GPUs was removed, applying the transpose conditionally based on the vLLM version's behavior.
  • Standby Utility Override: Added a new environment variable, UNSLOTH_VLLM_STANDBY_UTIL_OVERRIDE, which allows users to override the default standby GPU memory utilization. This provides greater flexibility for managing resource allocation in extreme or specific scenarios.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces two main changes: it adapts to a recent change in vLLM regarding FP8 weight scale transposition and adds a flag to override the standby GPU utilization setting.

My review focuses on the correctness and robustness of the vLLM adaptation. I've found a potential critical issue in the logic for handling the weight scale transpose, which might lead to incorrect model weights. I've also included a suggestion to improve the robustness of the version detection. Additionally, there are some minor code cleanup suggestions. The standby override feature looks good.

# https://github.com/vllm-project/vllm/commit/f9a4087182ffcd9404779fcda876f820b3b26d5f#diff-cce58c0ceb6a9b15a01f117d734b93736acc25ed89921c2eacc58ea05bd34d0eL1155-L1157
from vllm.model_executor.layers.quantization.utils.fp8_utils import maybe_post_process_fp8_weight_block
from inspect import getsource
needs_transpose_check = 'layer.weight_scale.data.T.contiguous()' in getsource(maybe_post_process_fp8_weight_block)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This check using getsource is fragile and can break easily with vllm updates. A version check would be more robust.

More importantly, the logic seems inverted. needs_transpose_check is True for older vllm versions that already perform the transpose. Your code then performs another transpose (on line 948), resulting in a double transpose. The transpose should only happen for newer vllm versions that don't do it.

The condition on line 942 should likely be if not needs_transpose_check:.

For a more robust solution, consider checking the vllm version:

from vllm import __version__ as VLLM_VERSION
from packaging.version import Version

# The transpose was removed in vLLM v0.4.1.
# We need to transpose only if vLLM does not.
should_transpose = Version(VLLM_VERSION) >= Version("0.4.1")

Then use if should_transpose: where you currently use if needs_transpose_check:.

cutlass_block_fp8_supported = torch.ops._C.cutlass_scaled_mm_supports_block_fp8(sm_cap)
except Exception as e:
logger.info(f"Unsloth: Could not import vLLM cutlass_block_fp8_supported: {e}")
pass
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This pass statement is unnecessary and can be removed. The same applies to the pass on line 950.

@danielhanchen danielhanchen merged commit 6d89b0e into unslothai:main Nov 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

2 participants