-
-
Notifications
You must be signed in to change notification settings - Fork 13.1k
ON HOLD - [Core] Lazy/Delayed CUDA graph #23184
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
ON HOLD - [Core] Lazy/Delayed CUDA graph #23184
Conversation
Signed-off-by: Diego-Castan <diego.castan@ibm.com>
Signed-off-by: Diego-Castan <diego.castan@ibm.com>
Signed-off-by: Diego-Castan <diego.castan@ibm.com>
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
vllm/v1/worker/utils.py
Outdated
|
|
||
|
|
||
| @contextmanager | ||
| def freeze_gc(allow_collect: bool = True): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function has been moved from the gpu_model_runner.py to utils.py based on this comment
|
Related Documentation No published documentation to review for changes on this repository. |
ProExpertProg
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for integrating this! I took a quick look and I have 3 overall pieces of feedback:
- If a cudagraph exists for the current
num_tokens, why are we capturing cudagraphs for sizes we haven't seen? Shouldn't we just wait until we see that size? I know that increases the unpredictability but it also increases savings as we don't capture cudagraphs we rarely use until we do use them. - Instead of capturing 0 cudagraphs at the start, should we at least capture the cudagraph for the largest capture size so that future cudagraphs can reuse that memory?
- Could we build in a mechanism to capture cudagraphs while the server sits idle?
vllm/config/compilation.py
Outdated
| """Sizes to capture cudagraph. | ||
| - None (default): capture sizes are inferred from vllm config. | ||
| - list[int]: capture sizes are specified as given.""" | ||
| use_cudagraph_delayed_capture: bool = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: I would call this lazy_cudagraph_capture or delay_cudagraph_capture
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. To maintain consistency across our codebase, I recommend retaining the use of the use_ prefix before variable names. This convention provides a clear understanding of the variable's nature and its boolean type, as demonstrated by examples like use_cudagraph or use_inductor.
vllm/config/compilation.py
Outdated
| to note that this speedup during initialization may result in an | ||
| increased Time-To-First-Token (TTFT) for the initial token inferences.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would also say it makes TTFT less predictable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. I'll add it.
vllm/v1/worker/gpu_model_runner.py
Outdated
| gc_collect = ( | ||
| not specific_token_num) if specific_token_num is not None else True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this gc_collect - it's very unclear when it's supposed to be true or not
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The gc_collect variable is utilized within the freeze_gc function to enable garbage collection. However, this operation can consume a considerable amount of time and negatively impact the performance of cudagraph capture. In an "eager" mode (during initialization), the "collect" operation is performed only once, which is acceptable. To prevent repeated collection for every single cudagrap during lazy mode captures, we set the gc_collect variable to False when using specific_token_num to capture just one specific cudagraph. In this case, the garbage collector will not collect, and instead, the function will only freeze the specified cudagraph.
vllm/v1/worker/gpu_model_runner.py
Outdated
| not specific_token_num) if specific_token_num is not None else True | ||
| set_cudagraph_capturing_enabled( | ||
| True) if not specific_token_num else None | ||
| with freeze_gc(gc_collect), graph_capture(device=self.device): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Either call this maybe_freeze_gc or preferably use an ExitStack and call enter_context inside an if statement
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
vllm/v1/worker/utils.py
Outdated
|
|
||
| # gc collector is a time consuming operation | ||
| if allow_collect: | ||
| gc.collect() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just call gc.collect outside the function, doesn't need to be inside the context manager
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
vllm/v1/worker/gpu_worker.py
Outdated
| all_gather_group=get_tp_group())) | ||
|
|
||
| if (self.vllm_config.compilation_config.use_cudagraph_delayed_capture | ||
| and not self.model_config.enforce_eager |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove check for enforce_eager here - instead set use_cudagraph_delayed_capture to False in config init if enforce_eager is set
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed.
| # Check if there are any entries left in _token_compiled_cudagraphs | ||
| else: | ||
| # Update next_comp to the first item and remove it from the list | ||
| next_capture = self.incomplete_cudagraph_capture.pop(0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why would we want to capture cgs for an unrelated number of tokens?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this comment aligns with my response of your first bullet: we do not what number of tokens will have in a future inferences, but we want to keep the unpredictability as short as possible
|
Hi @ProExpertProg , Thanks a lot your review and your suggestions. I essentially agree with you in many of them. Individual comments for each bullet below:
As @lionelvillard told me, our primary objectives with this approach are twofold: 1) minimizing the startup time in a cool start scenario, and 2) controlling the unpredictability at the initial stages of the inference process. As you have already noticed, there is a trade-off between predictability and speed. However, considering a hypothetical situation where multiple users share the same node, it becomes apparent that limiting variability to seconds or minutes could yield more substantial benefits over time.
I agree with this one. Makes sense to me and I'll implement it.
Good idea. I propose to do it in a separate PR to keep the code small and clean. |
|
@diegocastanibm how about turning the delayed flag into: Also, after we capture the first (largest) cudagraph, we should report the memory usage and warn if we think we might run out of memory.
Sounds good! |
|
Also I think if you could add tests to existing cudagrapoh dispatcher and wrapper tests that would be great |
WoosukKwon
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @diegocastanibm, thanks for the PR.
I think capturing cuda graphs at runtime affects reliability a lot.
Also for simplicity, I think we should strictly limit it to init time only.
* Change name to use_delay_cudagraph_capture * Capture largest size during init * Better description * Changes in GC * Enforce_eager and use_delay_cudagraph_capture in config init Signed-off-by: Diego-Castan <diego.castan@ibm.com>
…nto delayed_cuda_graph
Indeed, it is feasible and beneficial to present additional choices. I will implement it soon |
Sure! No problem |
fhl2000
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @diegocastanibm, Thanks for implementing this. Left a few comments below.
| if next_capture: | ||
| logger.debug( | ||
| "CUDAgraph in execution model time for %d input tokens", | ||
| next_capture) | ||
| self.model_runner.capture_model(next_capture) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also should check that once the list is empty, disable capturing cudagraph globally by set_cudagraph_capturing_enabled(False) for extra safety.
|
|
||
| # Check if the scheduled token count is in our compiled CUDAgraphs list | ||
| # Priority to capture the token count that is in execution | ||
| if total_num_scheduled_tokens in self.incomplete_cudagraph_capture: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not use the padded num tokens for cudagraph. The num_scheduled_tokens can rarely be hit here.
|
This pull request has merge conflicts that must be resolved before it can be |
|
Based on the feedback from reviewers, we're putting this PR on hold. According to us, the work here represents valuable progress towards a faster and more efficient vLLM initialization, but we need to address the concerns raised before moving forward. |
@diegocastanibm what are the biggest concerns in your mind? It's unfortunately that currently you have to either make a trade-off between performance and startup latency. |
The biggest concerns on my mind are exactly what you've highlighted - the current trade-off between performance and startup latency is a significant limitation that affects real-world deployment scenarios. |
|
This might be a naive question, but are the CUDA graphs cached anywhere? Could you generate all the CUDA graphs for all the models/batch sizes you intend to use in your deployment and then mount them to your auto scaled machine's storage for vLLM to pick up? |
That's actually a good question! AFAIU (and please, correct me if I'm wrong), CUDA graphs can't really be pre-generated and cached in the way you're suggesting for a few key reasons:
The current approach where vLLM builds the CUDA graphs on-demand during warmup is really the most reliable way to handle this. The graphs need to be created in the actual runtime environment where they'll be executed to ensure compatibility and optimal performance. |
|
I encountered an issue when using Lazy CUDA Graph: during multi-TP inference, the first request (i.e., the one that triggers inference) returns garbled output. Have you encountered this issue before? @diegocastanibm |
This PR is on hold and I haven't run it again since Sept 2nd. So many things has changed in vLLM since then. However, when I was working on this PR, I never had an issue and I tested it multiple times and with different models and accelerators. |
Purpose
Currently vLLM captures cudagraphs as part of the engine initialization significantly slowing down vLLM startup time. We propose to capture cudagraphs lazily. Instead of performing dummy runs during the engine initialization phase, the idea is to do those runs during the execution. The highest priority of CUDAGraph caching is given by the current runtime shape if not cached already. Otherwise, it is capturing the highest shape that has not been captured yet.
More info in this issue.
Test Plan
Baseline with current approach:
CUDAGraph capturing during model execution:
Test Result
Benchmark results
CUDAGraph during the initialization (baseline):
Delayed CUDAGraph capture during inference using 1000 prompts from ShareGPT:
Memory and timing of CUDAGraph captures with model meta-llama/Llama-3.1-8B-Instruct:
Normal CUDAGraph capture during initialization:
CUDAGraph capture during inference:
CUDA Graph capture during inference with FULL CUDA Graph:
(Optional) Documentation Update
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.