-
-
Notifications
You must be signed in to change notification settings - Fork 12.3k
feat: add vllm.utils.device_utils module #31576
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat: add vllm.utils.device_utils module #31576
Conversation
Add device_utils module with GPU memory monitoring utilities as part of vllm/utils cleanup effort (Issue vllm-project#26900). Features: - get_device_property() for querying GPU capabilities - get_gpu_memory_info() for comprehensive memory stats - get_gpu_utilization() for memory usage percentage - clear_gpu_caches() for memory management - device_memory_tracing() context manager - estimate_model_memory_requirements() for planning This module helps users monitor and manage GPU resources during vLLM inference operations. Signed-off-by: codebasecomprehension <sayakmondal56@gmail.com>
Add comprehensive unit tests for device_utils functions including GPU memory info, utilization, cache clearing, memory tracing, and model memory estimation. Signed-off-by: codebasecomprehension <sayakmondal56@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a new vllm.utils.device_utils module, which is a good step towards better code organization. The new module provides a comprehensive set of utilities for GPU management.
My review has identified a few critical issues in the implementation of clear_gpu_caches and the device_memory_tracing context manager that need to be addressed. Additionally, there are some high-severity issues related to the accuracy of memory calculations and potential bugs in the memory estimation logic. I've provided detailed feedback and code suggestions for each of these points to help improve the correctness and robustness of this new module.
vllm/utils/device_utils.py
Outdated
| if device is not None: | ||
| if 0 <= device < torch.cuda.device_count(): | ||
| torch.cuda.empty_cache() | ||
| else: | ||
| for _ in range(torch.cuda.device_count()): | ||
| torch.cuda.empty_cache() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The implementation of clear_gpu_caches is incorrect and will not behave as documented.
- When
deviceis specified,torch.cuda.empty_cache()is called without switching to that device. It will always clear the cache of the current device, not the specified one. - When
deviceisNone, the loop callstorch.cuda.empty_cache()ntimes on the current device, instead of iterating through allnavailable devices and clearing each one's cache.
To fix this, you should use a torch.cuda.device context manager to switch to the correct device before clearing its cache.
| if device is not None: | |
| if 0 <= device < torch.cuda.device_count(): | |
| torch.cuda.empty_cache() | |
| else: | |
| for _ in range(torch.cuda.device_count()): | |
| torch.cuda.empty_cache() | |
| if device is not None: | |
| if 0 <= device < torch.cuda.device_count(): | |
| with torch.cuda.device(device): | |
| torch.cuda.empty_cache() | |
| else: | |
| for i in range(torch.cuda.device_count()): | |
| with torch.cuda.device(i): | |
| torch.cuda.empty_cache() |
vllm/utils/device_utils.py
Outdated
| def device_memory_tracing(device: int = 0) -> Generator[Dict[str, float], None, None]: | ||
| """Context manager to trace memory usage before and after a code block. | ||
| This provides a convenient way to measure the memory impact of | ||
| specific operations or code sections. | ||
| Args: | ||
| device: The CUDA device ID to trace (default: 0) | ||
| Yields: | ||
| A dictionary containing memory statistics that gets updated | ||
| with delta information after the context exits. | ||
| Example: | ||
| >>> with device_memory_tracing() as mem_before: | ||
| ... # Memory stats at entry | ||
| ... pass | ||
| >>> # After exit, mem_before contains delta information | ||
| >>> print(f"Memory delta: {mem_before.get('delta_gb', 0):.3f} GB") | ||
| """ | ||
| if not torch.cuda.is_available(): | ||
| yield { | ||
| 'device': device, | ||
| 'before_allocated_gb': 0.0, | ||
| 'after_allocated_gb': 0.0, | ||
| 'delta_gb': 0.0, | ||
| } | ||
| return | ||
|
|
||
| torch.cuda.synchronize(device) | ||
| before_allocated = torch.cuda.memory_allocated(device) | ||
|
|
||
| yield { | ||
| 'device': device, | ||
| 'before_allocated_gb': before_allocated / (1024**3), | ||
| } | ||
|
|
||
| # After the context, calculate delta | ||
| torch.cuda.synchronize(device) | ||
| after_allocated = torch.cuda.memory_allocated(device) | ||
|
|
||
| yield { | ||
| 'device': device, | ||
| 'before_allocated_gb': before_allocated / (1024**3), | ||
| 'after_allocated_gb': after_allocated / (1024**3), | ||
| 'delta_gb': (after_allocated - before_allocated) / (1024**3), | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The context manager device_memory_tracing is implemented incorrectly. A generator-based context manager should only have a single yield statement. The presence of two yield statements will cause a TypeError: 'generator' object is not an iterator when the context manager is used.
To fix this, you should yield a dictionary and then update it in place after the yield statement, before the function returns. This ensures the caller gets the updated memory delta information in the same dictionary object.
def device_memory_tracing(device: int = 0) -> Generator[Dict[str, float], None, None]:
"""Context manager to trace memory usage before and after a code block.
This provides a convenient way to measure the memory impact of
specific operations or code sections.
Args:
device: The CUDA device ID to trace (default: 0)
Yields:
A dictionary containing memory statistics that gets updated
with delta information after the context exits.
Example:
>>> with device_memory_tracing() as mem_info:
... # Memory stats at entry
... pass
>>> # After exit, mem_info contains delta information
>>> print(f"Memory delta: {mem_info.get('delta_gb', 0):.3f} GB")
"""
if not torch.cuda.is_available():
mem_info = {
'device': device,
'before_allocated_gb': 0.0,
'after_allocated_gb': 0.0,
'delta_gb': 0.0,
}
yield mem_info
return
torch.cuda.synchronize(device)
before_allocated = torch.cuda.memory_allocated(device)
mem_info = {
'device': device,
'before_allocated_gb': before_allocated / (1024**3),
}
yield mem_info
# After the context, calculate delta
torch.cuda.synchronize(device)
after_allocated = torch.cuda.memory_allocated(device)
mem_info['after_allocated_gb'] = after_allocated / (1024**3)
mem_info['delta_gb'] = (after_allocated - before_allocated) / (1024**3)
vllm/utils/device_utils.py
Outdated
|
|
||
| allocated = torch.cuda.memory_allocated(device) | ||
| reserved = torch.cuda.memory_reserved(device) | ||
| free = total_memory - reserved |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The calculation of free memory as total_memory - reserved can be misleading. This value represents the memory not reserved by PyTorch's caching allocator, but it may not be truly available if other processes are using the GPU.
A more accurate way to determine the available GPU memory is to use torch.cuda.mem_get_info(device). This will provide the actual free memory as reported by the CUDA driver.
| free = total_memory - reserved | |
| free, _ = torch.cuda.mem_get_info(device) |
vllm/utils/device_utils.py
Outdated
| props = torch.cuda.get_device_properties(device) | ||
| reserved = torch.cuda.memory_reserved(device) | ||
| free = props.total_memory - reserved | ||
|
|
||
| return free / (1024**3) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The calculation of free memory as props.total_memory - reserved can be misleading as it doesn't account for memory used by other processes. It's more accurate to use torch.cuda.mem_get_info(device) to get the actual available memory from the CUDA driver.
This change also simplifies the function and avoids an uncached call to torch.cuda.get_device_properties(device).
| props = torch.cuda.get_device_properties(device) | |
| reserved = torch.cuda.memory_reserved(device) | |
| free = props.total_memory - reserved | |
| return free / (1024**3) | |
| free, _ = torch.cuda.mem_get_info(device) | |
| return free / (1024**3) |
vllm/utils/device_utils.py
Outdated
| precision = "bf16" | ||
| else: | ||
| precision = "fp16" | ||
| except: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using a bare except: is discouraged as it can catch and hide unexpected errors, including system-exiting exceptions like SystemExit or KeyboardInterrupt. It's better to catch a specific exception, or Exception at the very least, to avoid unintended side effects and make debugging easier.
except Exception:| total = weight_memory | ||
| if "activations_gb" in estimates: | ||
| total += estimates["activations_gb"] | ||
| estimates["total_estimate_gb"] = total |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The total_estimate_gb calculation is incomplete. It includes weights_gb and activations_gb, but it omits embeddings_gb, which can be a significant part of a model's memory footprint. The total estimate should include all relevant components for a more accurate result.
total = weight_memory
if "activations_gb" in estimates:
total += estimates["activations_gb"]
if "embeddings_gb" in estimates:
total += estimates["embeddings_gb"]
estimates["total_estimate_gb"] = total|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run You ask your reviewers to trigger select CI tests on top of Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add If you have any questions, please reach out to us on Slack at https://slack.vllm.ai. 🚀 |
|
Where are these utils being used in vLLM? Otherwise it's just dead code |
Fix code review issues: device selection, context manager, memory calculations Signed-off-by: codebasecomprehension <sayakmondal56@gmail.com>
Thanks for the detailed feedback! I've fixed all the issues: All suggestions have been applied. Ready for re-review! |
Summary
This PR adds a new
vllm/utils/device_utilsmodule as part of the vLLM utils cleanup effort (Issue #26900). The module provides utilities for GPU device management, memory monitoring, and hardware-specific operations.Changes
vllm/utils/device_utils.py: New module with:get_device_property(): Query GPU capabilities (name, memory, compute cap)get_gpu_name(): Get GPU device nameget_gpu_memory_info(): Comprehensive memory statisticsget_gpu_utilization(): Memory usage percentageget_available_gpu_memory(): Free memory in GBclear_gpu_caches(): Release cached memorydevice_memory_tracing(): Context manager for profilingget_device_count(): Number of available GPUsis_using_gpu(): Check CUDA availabilityget_current_device(): Current device indexestimate_model_memory_requirements(): Plan resource allocationtests/test_device_utils.py: Comprehensive unit testsMotivation
This contribution addresses Issue #26900 (Clean up vllm.utils) by:
Example Usage