Build KEDA External Scaler with NVML Metrics

This title was summarized by AI from the post below.

Cloud Native Computing Foundation (CNCF)

168,636 followers

Stop scaling GPU workloads on blind CPU/memory metrics. In this guide, Pavan Maduri breaks down how to build a KEDA external scaler using a DaemonSet to pull NVML metrics over gRPC directly—giving you sub-second scaling while bypassing the Prometheus pipeline entirely. 🧠 Read the full walkthrough: https://bit.ly/4nTPVEv #Kubernetes #KEDA #CloudNative #AIInfrastructure

3 Comments

Venkata Ramesh 3d

Very insightful architecture Pavan Madduri One of the common challenges with AI and ML platforms is that infrastructure scaling decisions are often disconnected from actual GPU demand. Bypassing the traditional metrics pipeline and using direct NVML driven signals is an interesting approach to achieve faster and more accurate scaling behavior. This is the kind of cloud native innovation that helps bridge the gap between platform operations and AI workload requirements.

1 Reaction

Pavan Madduri 3d

Thanks Cloud Native Computing Foundation (CNCF) for featuring this! Bypassing the Prometheus pipeline and reading NVML metrics directly over gRPC was a massive friction point we had to solve for sub-second scaling. Glad to share the KEDA external scaler architecture with the broader cloud-native community—hope it helps others scaling heavy GPU inference workloads!

Alessandro Corsico 2d

Really interesting!

See more comments

To view or add a comment, sign in

More Relevant Posts

Erik Andersen
2w
Report this post
A new chapter for our team and a new layer in the AI infrastructure stack: MinIO MemKV. As inference workloads scale, context is a constraint, increasing cost and limiting throughput. MemKV solves this and turns context into a shared resource across GPUs.

Introducing MinIO MemKV: Purpose built Context Store for Inference at scale min.io
Like Comment
To view or add a comment, sign in
Abraham Zvi Barak
2w
Report this post
IN AI INFERENCE, "Late" IS THE SAME AS "Broken."! If you are waiting for a "Scale Event" to trigger after your latency spikes, you’ve already lost the user. Traditional reactive autoscaling works for web apps, but it fails for LLMs. When memory competition hits, latency doesn't crawl—it explodes. 🎇 SwarmOne provides a different path: 🆕 SLO-Driven Autoscaling. 🆕 Proactive: We scale based on real-time P99 trends, not stale CPU metrics. 🆕 Fast: Adjustments happen in milliseconds, not minutes. 🆕 Efficient: We turn "hoarded" hardware into 90%+ active utilization. How does your team deal with the latency issue? 🧐 At what cost? 💰 #Inference #Scalability #AI #SwarmOne #Performance
SwarmOne

2,164 followers
1mo

Kubernetes HPA looks at CPU utilization. When CPU hits 80%, it adds pods. Inference latency doesn't correlate with CPU. A GPU can be at 35% utilization and your P99 is already 8 seconds - because every request is competing for the same KV cache memory. CPU looks fine. Memory looks fine. Users are waiting. By the time HPA reacts, the damage is done. Reactive autoscaling is always too late for inference workloads. SwarmOne autoscales on inference-native signals. P99 TTFT trending up? Scale before it crosses the SLO. Token throughput dropping? Add capacity before users notice. Milliseconds, not minutes.
Like Comment
To view or add a comment, sign in
Pranav Sharma
4d
Report this post
AI ambitions are only as strong as the infrastructure behind them. But when GPU costs keep rising and utilization stays invisible, it's hard to know whether you're investing wisely or just spending. Datadog GPU Monitoring, now generally available, gives teams end-to-end visibility, from GPU health and performance to the workloads and costs tied to every piece of hardware. The result: more efficient capacity usage and a clearer path to AI revenue.

Datadog

534,293 followers
4d

Your GPUs are expensive. Are they earning their keep? Datadog GPU Monitoring is now GA — tying workloads and costs to hardware so your teams know exactly where capacity is going and how to use it better. Get more out of what you have: https://bit.ly/4vwFzgv
Like Comment
To view or add a comment, sign in
Swift Compute

2,195 followers
3w
Report this post
Pushing an 8×H100 cluster on Swift Compute and the numbers are looking strong 💪 - 97.4% GPU utilization across all 8 GPUs - 148,162 tokens/sec - Running stable for 4h+ at 72°C This is exactly the kind of efficient, high-throughput performance we’re building for. Still very early (pre-product), but the foundation is coming together nicely. Excited to open this up to more users soon. If you’re training or inferencing at scale and want early access, let us know 👇 #AIInfrastructure #GPUCluster #LLM #SwiftCompute
Like Comment
To view or add a comment, sign in
TheNextGenTechInsider.com

753 followers
4d
Report this post
Deploy Production-Ready vLLM Inference Servers on Kubernetes Using AMD Instinct GPUs 📌 Unlock massive throughput for enterprise LLMs by deploying production-ready vLLM inference servers on Kubernetes using AMD Instinct GPUs. This new framework automates high-performance orchestration, leveraging the MI300X's vast memory to eliminate VRAM bottlenecks and scale workloads seamlessly. It provides a robust, automated path to running large-scale models with high availability and near-zero memory waste. 🔗 Read more: https://lnkd.in/dguRSzVk #Vllm #Kubernetes #Amdinstinct #Gpuoperator #Largelanguagemodels
Like Comment
To view or add a comment, sign in
1Legion

377 followers
3w
Report this post
Shared GPU rental means competing for VRAM, bandwidth, and compute with tenants you don't control. Dedicated bare metal means one server, one tenant, full output, every run. The performance difference is real. At scale, the cost difference usually surprises people. The 8x RTX Pro 6000 Blackwell Max-Q is available now on 1Legion: 768 GB total VRAM, from $1.34/GPU/hr, no egress fees. Talk to an Engineer: https://lnkd.in/dDiWdTqT #baremetalgpu #gpuinfrastructure #generativeai #aiinference
Like Comment
To view or add a comment, sign in
Doug Brown
3w
Report this post
Thanks for sharing Sara Carroll . Weka is positioned extremely well for those customers who've invested in Kubernetes, or plan to. The intersection of containers and AI is here now. The best part, in a hyperscaler, on-premises or both- Weka is the here to help.

WEKA

38,597 followers
1mo

🤔 Scaling Kubernetes but not seeing performance gains? It’s probably not your CPU. As workloads scale, many teams hit the same wall: low CPU, idle GPUs, and flat throughput—because storage can’t keep up with parallel I/O demands. The fix isn’t more compute. It’s rethinking storage. With NeuralMesh, storage scales with Kubernetes—unlocking: 🟣 2–3x higher training throughput 🟣 GPU utilization increases from ~43% to >90% 🟣 Real performance gains without changing your apps If your workloads feel “stuck,” this might be why. 🔗 https://weka.ly/4mk07Fz
Like Comment
To view or add a comment, sign in
Kube Builders

12,397 followers
6d
Report this post
This tutorial teaches how to run a production-ready vLLM inference server on Kubernetes with AMD Instinct GPUs using containerd, the AMD GPU Operator, persistent storage, and MetalLB. More: https://ku.bz/D6PyJ-sWz
Like Comment
To view or add a comment, sign in
Drew Steinberg
1mo
Report this post
💡 What does it take to move an #AIFactory from proof-of-concept to production? NVIDIA Enterprise Reference Architectures provide infrastructure guidance for on-premises deployments, defining how #compute, networking, storage, software, and system components integrate into a production-ready #AI platform. Read about three configurations for AI factories, built around NVIDIA #RTX PRO Blackwell Server Edition GPUs,# HGX systems, and GB300 NVL72. ➡️ https://bit.ly/42EC4bc #GPU #DataCenter
Like Comment
To view or add a comment, sign in
Jesmin Jahan Tithi, Ph.D
2w Edited
Report this post
Scaling GPUs breaks the network before compute! What I mean by that is: compute seems to be scaling linearly in the upcoming systems, but Network scaling hits: IO limits, switch radix and packaging limits!

3 Comments
Like Comment
To view or add a comment, sign in

Cloud Native Computing Foundation (CNCF)

168,636 followers

View Profile Connect

Build KEDA External Scaler with NVML Metrics

More Relevant Posts

Explore content categories