Skip to content

Conversation

@codebasecomprehension
Copy link

@codebasecomprehension codebasecomprehension commented Dec 31, 2025

Summary

This PR adds a new vllm/utils/device_utils module as part of the vLLM utils cleanup effort (Issue #26900). The module provides utilities for GPU device management, memory monitoring, and hardware-specific operations.

Changes

  • vllm/utils/device_utils.py: New module with:

    • get_device_property(): Query GPU capabilities (name, memory, compute cap)
    • get_gpu_name(): Get GPU device name
    • get_gpu_memory_info(): Comprehensive memory statistics
    • get_gpu_utilization(): Memory usage percentage
    • get_available_gpu_memory(): Free memory in GB
    • clear_gpu_caches(): Release cached memory
    • device_memory_tracing(): Context manager for profiling
    • get_device_count(): Number of available GPUs
    • is_using_gpu(): Check CUDA availability
    • get_current_device(): Current device index
    • estimate_model_memory_requirements(): Plan resource allocation
  • tests/test_device_utils.py: Comprehensive unit tests

Motivation

This contribution addresses Issue #26900 (Clean up vllm.utils) by:

  1. Separating device/GPU utilities into a dedicated module
  2. Providing a clean API for GPU resource management
  3. Helping users monitor and optimize GPU usage during vLLM inference

Example Usage

from vllm.utils.device_utils import (
    get_gpu_memory_info,
    get_gpu_utilization,
    device_memory_tracing,
)

# Check memory before and after inference
info = get_gpu_memory_info()
print(f"GPU Memory: {info['allocated_gb']:.2f}GB / {info['total_gb']:.2f}GB")

# Trace memory usage of an operation
with device_memory_tracing() as mem:
    results = model.generate(prompts)
print(f"Memory delta: {mem.get('delta_gb', 0):.3f} GB")
Add device_utils module with GPU memory monitoring utilities
as part of vllm/utils cleanup effort (Issue vllm-project#26900).

Features:
- get_device_property() for querying GPU capabilities
- get_gpu_memory_info() for comprehensive memory stats
- get_gpu_utilization() for memory usage percentage
- clear_gpu_caches() for memory management
- device_memory_tracing() context manager
- estimate_model_memory_requirements() for planning

This module helps users monitor and manage GPU resources
during vLLM inference operations.

Signed-off-by: codebasecomprehension <sayakmondal56@gmail.com>
Add comprehensive unit tests for device_utils functions
including GPU memory info, utilization, cache clearing,
memory tracing, and model memory estimation.

Signed-off-by: codebasecomprehension <sayakmondal56@gmail.com>
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new vllm.utils.device_utils module, which is a good step towards better code organization. The new module provides a comprehensive set of utilities for GPU management.

My review has identified a few critical issues in the implementation of clear_gpu_caches and the device_memory_tracing context manager that need to be addressed. Additionally, there are some high-severity issues related to the accuracy of memory calculations and potential bugs in the memory estimation logic. I've provided detailed feedback and code suggestions for each of these points to help improve the correctness and robustness of this new module.

Comment on lines 209 to 214
if device is not None:
if 0 <= device < torch.cuda.device_count():
torch.cuda.empty_cache()
else:
for _ in range(torch.cuda.device_count()):
torch.cuda.empty_cache()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The implementation of clear_gpu_caches is incorrect and will not behave as documented.

  1. When device is specified, torch.cuda.empty_cache() is called without switching to that device. It will always clear the cache of the current device, not the specified one.
  2. When device is None, the loop calls torch.cuda.empty_cache() n times on the current device, instead of iterating through all n available devices and clearing each one's cache.

To fix this, you should use a torch.cuda.device context manager to switch to the correct device before clearing its cache.

Suggested change
if device is not None:
if 0 <= device < torch.cuda.device_count():
torch.cuda.empty_cache()
else:
for _ in range(torch.cuda.device_count()):
torch.cuda.empty_cache()
if device is not None:
if 0 <= device < torch.cuda.device_count():
with torch.cuda.device(device):
torch.cuda.empty_cache()
else:
for i in range(torch.cuda.device_count()):
with torch.cuda.device(i):
torch.cuda.empty_cache()
Comment on lines 218 to 264
def device_memory_tracing(device: int = 0) -> Generator[Dict[str, float], None, None]:
"""Context manager to trace memory usage before and after a code block.
This provides a convenient way to measure the memory impact of
specific operations or code sections.
Args:
device: The CUDA device ID to trace (default: 0)
Yields:
A dictionary containing memory statistics that gets updated
with delta information after the context exits.
Example:
>>> with device_memory_tracing() as mem_before:
... # Memory stats at entry
... pass
>>> # After exit, mem_before contains delta information
>>> print(f"Memory delta: {mem_before.get('delta_gb', 0):.3f} GB")
"""
if not torch.cuda.is_available():
yield {
'device': device,
'before_allocated_gb': 0.0,
'after_allocated_gb': 0.0,
'delta_gb': 0.0,
}
return

torch.cuda.synchronize(device)
before_allocated = torch.cuda.memory_allocated(device)

yield {
'device': device,
'before_allocated_gb': before_allocated / (1024**3),
}

# After the context, calculate delta
torch.cuda.synchronize(device)
after_allocated = torch.cuda.memory_allocated(device)

yield {
'device': device,
'before_allocated_gb': before_allocated / (1024**3),
'after_allocated_gb': after_allocated / (1024**3),
'delta_gb': (after_allocated - before_allocated) / (1024**3),
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The context manager device_memory_tracing is implemented incorrectly. A generator-based context manager should only have a single yield statement. The presence of two yield statements will cause a TypeError: 'generator' object is not an iterator when the context manager is used.

To fix this, you should yield a dictionary and then update it in place after the yield statement, before the function returns. This ensures the caller gets the updated memory delta information in the same dictionary object.

def device_memory_tracing(device: int = 0) -> Generator[Dict[str, float], None, None]:
    """Context manager to trace memory usage before and after a code block.
    
    This provides a convenient way to measure the memory impact of
    specific operations or code sections.
    
    Args:
        device: The CUDA device ID to trace (default: 0)
    
    Yields:
        A dictionary containing memory statistics that gets updated
        with delta information after the context exits.
    
    Example:
        >>> with device_memory_tracing() as mem_info:
        ...     # Memory stats at entry
        ...     pass
        >>> # After exit, mem_info contains delta information
        >>> print(f"Memory delta: {mem_info.get('delta_gb', 0):.3f} GB")
    """
    if not torch.cuda.is_available():
        mem_info = {
            'device': device,
            'before_allocated_gb': 0.0,
            'after_allocated_gb': 0.0,
            'delta_gb': 0.0,
        }
        yield mem_info
        return
    
    torch.cuda.synchronize(device)
    before_allocated = torch.cuda.memory_allocated(device)
    
    mem_info = {
        'device': device,
        'before_allocated_gb': before_allocated / (1024**3),
    }
    
    yield mem_info
    
    # After the context, calculate delta
    torch.cuda.synchronize(device)
    after_allocated = torch.cuda.memory_allocated(device)
    
    mem_info['after_allocated_gb'] = after_allocated / (1024**3)
    mem_info['delta_gb'] = (after_allocated - before_allocated) / (1024**3)

allocated = torch.cuda.memory_allocated(device)
reserved = torch.cuda.memory_reserved(device)
free = total_memory - reserved
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The calculation of free memory as total_memory - reserved can be misleading. This value represents the memory not reserved by PyTorch's caching allocator, but it may not be truly available if other processes are using the GPU.

A more accurate way to determine the available GPU memory is to use torch.cuda.mem_get_info(device). This will provide the actual free memory as reported by the CUDA driver.

Suggested change
free = total_memory - reserved
free, _ = torch.cuda.mem_get_info(device)
Comment on lines 181 to 185
props = torch.cuda.get_device_properties(device)
reserved = torch.cuda.memory_reserved(device)
free = props.total_memory - reserved

return free / (1024**3)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The calculation of free memory as props.total_memory - reserved can be misleading as it doesn't account for memory used by other processes. It's more accurate to use torch.cuda.mem_get_info(device) to get the actual available memory from the CUDA driver.

This change also simplifies the function and avoids an uncached call to torch.cuda.get_device_properties(device).

Suggested change
props = torch.cuda.get_device_properties(device)
reserved = torch.cuda.memory_reserved(device)
free = props.total_memory - reserved
return free / (1024**3)
free, _ = torch.cuda.mem_get_info(device)
return free / (1024**3)
precision = "bf16"
else:
precision = "fp16"
except:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Using a bare except: is discouraged as it can catch and hide unexpected errors, including system-exiting exceptions like SystemExit or KeyboardInterrupt. It's better to catch a specific exception, or Exception at the very least, to avoid unintended side effects and make debugging easier.

            except Exception:
Comment on lines 413 to 416
total = weight_memory
if "activations_gb" in estimates:
total += estimates["activations_gb"]
estimates["total_estimate_gb"] = total
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The total_estimate_gb calculation is incomplete. It includes weights_gb and activations_gb, but it omits embeddings_gb, which can be a significant part of a model's memory footprint. The total estimate should include all relevant components for a more accurate result.

    total = weight_memory
    if "activations_gb" in estimates:
        total += estimates["activations_gb"]
    if "embeddings_gb" in estimates:
        total += estimates["embeddings_gb"]
    estimates["total_estimate_gb"] = total
@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

@DarkLight1337
Copy link
Member

Where are these utils being used in vLLM? Otherwise it's just dead code

Fix code review issues: device selection, context manager, memory calculations

Signed-off-by: codebasecomprehension <sayakmondal56@gmail.com>
@codebasecomprehension
Copy link
Author

Where are these utils being used in vLLM? Otherwise it's just dead code

Thanks for the detailed feedback! I've fixed all the issues:
✅ Fixed clear_gpu_caches() to properly switch devices
✅ Fixed device_memory_tracing() to use single yield
✅ Updated memory calculations to use torch.cuda.mem_get_info()
✅ Changed bare except to except Exception
✅ Added embeddings to total estimate

All suggestions have been applied. Ready for re-review!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

2 participants