feat: add vllm.utils.device_utils module #31576

codebasecomprehension · 2025-12-31T11:18:17Z

Summary

This PR adds a new vllm/utils/device_utils module as part of the vLLM utils cleanup effort (Issue #26900). The module provides utilities for GPU device management, memory monitoring, and hardware-specific operations.

Changes

vllm/utils/device_utils.py: New module with:
- get_device_property(): Query GPU capabilities (name, memory, compute cap)
- get_gpu_name(): Get GPU device name
- get_gpu_memory_info(): Comprehensive memory statistics
- get_gpu_utilization(): Memory usage percentage
- get_available_gpu_memory(): Free memory in GB
- clear_gpu_caches(): Release cached memory
- device_memory_tracing(): Context manager for profiling
- get_device_count(): Number of available GPUs
- is_using_gpu(): Check CUDA availability
- get_current_device(): Current device index
- estimate_model_memory_requirements(): Plan resource allocation
tests/test_device_utils.py: Comprehensive unit tests

Motivation

This contribution addresses Issue #26900 (Clean up vllm.utils) by:

Separating device/GPU utilities into a dedicated module
Providing a clean API for GPU resource management
Helping users monitor and optimize GPU usage during vLLM inference

Example Usage

from vllm.utils.device_utils import (
    get_gpu_memory_info,
    get_gpu_utilization,
    device_memory_tracing,
)

# Check memory before and after inference
info = get_gpu_memory_info()
print(f"GPU Memory: {info['allocated_gb']:.2f}GB / {info['total_gb']:.2f}GB")

# Trace memory usage of an operation
with device_memory_tracing() as mem:
    results = model.generate(prompts)
print(f"Memory delta: {mem.get('delta_gb', 0):.3f} GB")

Add device_utils module with GPU memory monitoring utilities as part of vllm/utils cleanup effort (Issue vllm-project#26900). Features: - get_device_property() for querying GPU capabilities - get_gpu_memory_info() for comprehensive memory stats - get_gpu_utilization() for memory usage percentage - clear_gpu_caches() for memory management - device_memory_tracing() context manager - estimate_model_memory_requirements() for planning This module helps users monitor and manage GPU resources during vLLM inference operations. Signed-off-by: codebasecomprehension <sayakmondal56@gmail.com>

Add comprehensive unit tests for device_utils functions including GPU memory info, utilization, cache clearing, memory tracing, and model memory estimation. Signed-off-by: codebasecomprehension <sayakmondal56@gmail.com>

gemini-code-assist

Code Review

This pull request introduces a new vllm.utils.device_utils module, which is a good step towards better code organization. The new module provides a comprehensive set of utilities for GPU management.

My review has identified a few critical issues in the implementation of clear_gpu_caches and the device_memory_tracing context manager that need to be addressed. Additionally, there are some high-severity issues related to the accuracy of memory calculations and potential bugs in the memory estimation logic. I've provided detailed feedback and code suggestions for each of these points to help improve the correctness and robustness of this new module.

gemini-code-assist · 2025-12-31T11:20:17Z

vllm/utils/device_utils.py

+    if device is not None:
+        if 0 <= device < torch.cuda.device_count():
+            torch.cuda.empty_cache()
+    else:
+        for _ in range(torch.cuda.device_count()):
+            torch.cuda.empty_cache()


The implementation of clear_gpu_caches is incorrect and will not behave as documented.

When device is specified, torch.cuda.empty_cache() is called without switching to that device. It will always clear the cache of the current device, not the specified one.

When device is None, the loop calls torch.cuda.empty_cache() n times on the current device, instead of iterating through all n available devices and clearing each one's cache.

To fix this, you should use a torch.cuda.device context manager to switch to the correct device before clearing its cache.

Suggested change

if device is not None:

if 0 <= device < torch.cuda.device_count():

torch.cuda.empty_cache()

else:

for _ in range(torch.cuda.device_count()):

torch.cuda.empty_cache()

if device is not None:

if 0 <= device < torch.cuda.device_count():

with torch.cuda.device(device):

torch.cuda.empty_cache()

else:

for i in range(torch.cuda.device_count()):

with torch.cuda.device(i):

torch.cuda.empty_cache()

gemini-code-assist · 2025-12-31T11:20:17Z

vllm/utils/device_utils.py

+def device_memory_tracing(device: int = 0) -> Generator[Dict[str, float], None, None]:
+    """Context manager to trace memory usage before and after a code block.
+    
+    This provides a convenient way to measure the memory impact of
+    specific operations or code sections.
+    
+    Args:
+        device: The CUDA device ID to trace (default: 0)
+    
+    Yields:
+        A dictionary containing memory statistics that gets updated
+        with delta information after the context exits.
+    
+    Example:
+        >>> with device_memory_tracing() as mem_before:
+        ...     # Memory stats at entry
+        ...     pass
+        >>> # After exit, mem_before contains delta information
+        >>> print(f"Memory delta: {mem_before.get('delta_gb', 0):.3f} GB")
+    """
+    if not torch.cuda.is_available():
+        yield {
+            'device': device,
+            'before_allocated_gb': 0.0,
+            'after_allocated_gb': 0.0,
+            'delta_gb': 0.0,
+        }
+        return
+
+    torch.cuda.synchronize(device)
+    before_allocated = torch.cuda.memory_allocated(device)
+
+    yield {
+        'device': device,
+        'before_allocated_gb': before_allocated / (1024**3),
+    }
+
+    # After the context, calculate delta
+    torch.cuda.synchronize(device)
+    after_allocated = torch.cuda.memory_allocated(device)
+
+    yield {
+        'device': device,
+        'before_allocated_gb': before_allocated / (1024**3),
+        'after_allocated_gb': after_allocated / (1024**3),
+        'delta_gb': (after_allocated - before_allocated) / (1024**3),
+    }


The context manager device_memory_tracing is implemented incorrectly. A generator-based context manager should only have a single yield statement. The presence of two yield statements will cause a TypeError: 'generator' object is not an iterator when the context manager is used.

To fix this, you should yield a dictionary and then update it in place after the yield statement, before the function returns. This ensures the caller gets the updated memory delta information in the same dictionary object.

def device_memory_tracing(device: int = 0) -> Generator[Dict[str, float], None, None]: """Context manager to trace memory usage before and after a code block. This provides a convenient way to measure the memory impact of specific operations or code sections. Args: device: The CUDA device ID to trace (default: 0) Yields: A dictionary containing memory statistics that gets updated with delta information after the context exits. Example: >>> with device_memory_tracing() as mem_info: ... # Memory stats at entry ... pass >>> # After exit, mem_info contains delta information >>> print(f"Memory delta: {mem_info.get('delta_gb', 0):.3f} GB") """ if not torch.cuda.is_available(): mem_info = { 'device': device, 'before_allocated_gb': 0.0, 'after_allocated_gb': 0.0, 'delta_gb': 0.0, } yield mem_info return torch.cuda.synchronize(device) before_allocated = torch.cuda.memory_allocated(device) mem_info = { 'device': device, 'before_allocated_gb': before_allocated / (1024**3), } yield mem_info # After the context, calculate delta torch.cuda.synchronize(device) after_allocated = torch.cuda.memory_allocated(device) mem_info['after_allocated_gb'] = after_allocated / (1024**3) mem_info['delta_gb'] = (after_allocated - before_allocated) / (1024**3)

gemini-code-assist · 2025-12-31T11:20:17Z

vllm/utils/device_utils.py

+
+    allocated = torch.cuda.memory_allocated(device)
+    reserved = torch.cuda.memory_reserved(device)
+    free = total_memory - reserved


The calculation of free memory as total_memory - reserved can be misleading. This value represents the memory not reserved by PyTorch's caching allocator, but it may not be truly available if other processes are using the GPU.

A more accurate way to determine the available GPU memory is to use torch.cuda.mem_get_info(device). This will provide the actual free memory as reported by the CUDA driver.

Suggested change

free = total_memory - reserved

free, _ = torch.cuda.mem_get_info(device)

gemini-code-assist · 2025-12-31T11:20:17Z

vllm/utils/device_utils.py

+    props = torch.cuda.get_device_properties(device)
+    reserved = torch.cuda.memory_reserved(device)
+    free = props.total_memory - reserved
+
+    return free / (1024**3)


The calculation of free memory as props.total_memory - reserved can be misleading as it doesn't account for memory used by other processes. It's more accurate to use torch.cuda.mem_get_info(device) to get the actual available memory from the CUDA driver.

This change also simplifies the function and avoids an uncached call to torch.cuda.get_device_properties(device).

Suggested change

props = torch.cuda.get_device_properties(device)

reserved = torch.cuda.memory_reserved(device)

free = props.total_memory - reserved

return free / (1024**3)

free, _ = torch.cuda.mem_get_info(device)

return free / (1024**3)

gemini-code-assist · 2025-12-31T11:20:17Z

vllm/utils/device_utils.py

+                    precision = "bf16"
+                else:
+                    precision = "fp16"
+            except:


Using a bare except: is discouraged as it can catch and hide unexpected errors, including system-exiting exceptions like SystemExit or KeyboardInterrupt. It's better to catch a specific exception, or Exception at the very least, to avoid unintended side effects and make debugging easier.

except Exception:

gemini-code-assist · 2025-12-31T11:20:18Z

vllm/utils/device_utils.py

+    total = weight_memory
+    if "activations_gb" in estimates:
+        total += estimates["activations_gb"]
+    estimates["total_estimate_gb"] = total


The total_estimate_gb calculation is incomplete. It includes weights_gb and activations_gb, but it omits embeddings_gb, which can be a significant part of a model's memory footprint. The total estimate should include all relevant components for a more accurate result.

total = weight_memory if "activations_gb" in estimates: total += estimates["activations_gb"] if "embeddings_gb" in estimates: total += estimates["embeddings_gb"] estimates["total_estimate_gb"] = total

github-actions · 2025-12-31T11:26:27Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

DarkLight1337 · 2025-12-31T23:18:40Z

Where are these utils being used in vLLM? Otherwise it's just dead code

Fix code review issues: device selection, context manager, memory calculations Signed-off-by: codebasecomprehension <sayakmondal56@gmail.com>

codebasecomprehension · 2026-01-01T03:48:51Z

Where are these utils being used in vLLM? Otherwise it's just dead code

Thanks for the detailed feedback! I've fixed all the issues:
✅ Fixed clear_gpu_caches() to properly switch devices
✅ Fixed device_memory_tracing() to use single yield
✅ Updated memory calculations to use torch.cuda.mem_get_info()
✅ Changed bare except to except Exception
✅ Added embeddings to total estimate

All suggestions have been applied. Ready for re-review!

codebasecomprehension added 2 commits December 31, 2025 16:41

Add tests for vllm.utils.device_utils module

ba378d7

Add comprehensive unit tests for device_utils functions including GPU memory info, utilization, cache clearing, memory tracing, and model memory estimation. Signed-off-by: codebasecomprehension <sayakmondal56@gmail.com>

gemini-code-assist bot reviewed Dec 31, 2025

View reviewed changes

Refactor GPU memory management functions

668f976

Fix code review issues: device selection, context manager, memory calculations Signed-off-by: codebasecomprehension <sayakmondal56@gmail.com>

Merge branch 'main' into feature/device-utils

6600e69

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

feat: add vllm.utils.device_utils module #31576

feat: add vllm.utils.device_utils module #31576

codebasecomprehension commented Dec 31, 2025 •

edited by github-actions bot

Loading

gemini-code-assist bot left a comment

gemini-code-assist bot Dec 31, 2025

gemini-code-assist bot Dec 31, 2025

gemini-code-assist bot Dec 31, 2025

gemini-code-assist bot Dec 31, 2025

gemini-code-assist bot Dec 31, 2025

gemini-code-assist bot Dec 31, 2025

github-actions bot commented Dec 31, 2025

DarkLight1337 commented Dec 31, 2025

codebasecomprehension commented Jan 1, 2026

Labels

2 participants

	free = total_memory - reserved
	free, _ = torch.cuda.mem_get_info(device)

Uh oh!

feat: add vllm.utils.device_utils module #31576

Are you sure you want to change the base?

feat: add vllm.utils.device_utils module #31576

Conversation

codebasecomprehension commented Dec 31, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Motivation

Example Usage

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

gemini-code-assist bot Dec 31, 2025

Choose a reason for hiding this comment

gemini-code-assist bot Dec 31, 2025

Choose a reason for hiding this comment

gemini-code-assist bot Dec 31, 2025

Choose a reason for hiding this comment

gemini-code-assist bot Dec 31, 2025

Choose a reason for hiding this comment

gemini-code-assist bot Dec 31, 2025

Choose a reason for hiding this comment

gemini-code-assist bot Dec 31, 2025

Choose a reason for hiding this comment

github-actions bot commented Dec 31, 2025

DarkLight1337 commented Dec 31, 2025

codebasecomprehension commented Jan 1, 2026

Labels

2 participants

codebasecomprehension commented Dec 31, 2025 •

edited by github-actions bot

Loading