𝗘𝘅𝗽𝗹𝗮𝗶𝗻 𝗧𝗵𝗶𝘀: 𝗟𝗹𝗮𝗺𝗮 𝟯 𝗡𝗲𝗲𝗱𝘀 𝟮.𝟰𝗧𝗕. 𝗬𝗼𝘂𝗿 𝗚𝗣𝗨 𝗛𝗮𝘀 𝟴𝟬𝗚𝗕. 𝗜𝘁 𝗦𝘁𝗶𝗹𝗹 𝗧𝗿𝗮𝗶𝗻𝘀. Training Llama-3 405B needs ~2.4TB with BF16 + 8-bit Adam: • Weights: 810GB • Gradients: 810GB • Optimizer: 810GB (vs 3.24TB with standard Adam!) • Total: ~2.4TB (Illustrative budget—config-dependent; FP32 masters, ZeRO stage, and offload change totals) Your H100? 80GB. You'd need 30+ GPUs just to hold everything. 𝗧𝗵𝗿𝗲𝗲 𝗧𝗿𝗶𝗰𝗸𝘀 𝗧𝗵𝗮𝘁 𝗠𝗮𝗸𝗲 𝗜𝘁 𝗪𝗼𝗿𝗸 𝟭. 𝗗𝗮𝘁𝗮 𝗣𝗮𝗿𝗮𝗹𝗹𝗲𝗹: Split batch. Problem: Each GPU needs 2.4TB. Fix: ZeRO splits it across N GPUs. 𝟮. 𝗠𝗼𝗱𝗲𝗹 𝗣𝗮𝗿𝗮𝗹𝗹𝗲𝗹: Split layers. Problem: Sequential bottleneck. Fix: Pipeline batches. 𝟯. 𝗦𝗲𝗾𝘂𝗲𝗻𝗰𝗲 𝗣𝗮𝗿𝗮𝗹𝗹𝗲𝗹: Split tokens. This is the game changer. 8K tokens → 8 GPUs → 1K each. But attention needs every token to see all others. 𝗧𝗵𝗲 𝗠𝗮𝗴𝗶𝗰 𝗠𝗼𝗺𝗲𝗻𝘁: Instead of moving the 2.4TB model, GPUs only exchange attention keys/values (K,V). Each GPU: • Computes K,V for its 1K tokens (32MB) • Sends to others via all-to-all • Receives 7×32MB = 224MB total • Computes attention, deletes copies 𝟮𝟮𝟰𝗠𝗕 𝗺𝗼𝘃𝗲𝗱 𝗶𝗻𝘀𝘁𝗲𝗮𝗱 𝗼𝗳 𝟮.𝟰𝗧𝗕. That's 10,000x less. 𝗧𝗵𝗲 𝗥𝗲𝘀𝘂𝗹𝘁: Combine all three (ZeRO + tensor + pipeline + sequence parallel). Each GPU holds ~75GB instead of 2.4TB. This exact choreography powers ChatGPT, Claude, and every frontier model. Without it? 10K token limits. With it? Entire books in one context. Not magic. Just brilliant engineering making the impossible routine.
GPU Programming Insights
Explore top LinkedIn content from expert professionals.
-
-
Today was a big milestone for us - we launched CUDA Tile IR, a new tile-based programming model for our GPUs. CUDA Tile IR has two components: 1. cuTile - a Python DSL that dramatically simplifies writing high-performance CUDA kernels. 2. Tile IR - a language-agnostic virtual instruction set that (third party) compilers or DSLs can target. This is just the beginning. You'll see more features, broader platform support, and continued performance improvements in every CUDA release going forward. I'd love to hear from developers, researchers, and compiler folks who plan to explore cuTile or Tile IR. In particular, I'm excited to learn about novel algorithms or kernels built on cuTile, and new programming-language or compiler techniques unlocked by Tile IR. If you're interested, here are some great starting points: cuTile reference: https://lnkd.in/grmv8C3b Tile IR specification: https://lnkd.in/grjE7hBG Blog post 1: https://lnkd.in/gWUZ4sP2 Blog post 2: https://lnkd.in/gaf82Ybb
-
Sometimes when you set out to solve something small you end up delivering something huge. The team at Q-CTRL just did that with our partners NVIDIA and Oxford Quantum Circuits (OQC), achieving a totally new #GPU-optimized algorithm for the subgraph-isomorphism problem. One of the toughest challenges when it comes to practical scaling of #quantumcomputing is how to parse the problem of interest onto the device at hand. Which qubits are best? Which connectivity is most efficient? How can you use mathematical tricks to reduce the number of operations (and hence reduce opportunities for error)? Which parts of the process can be sped up with classical techniques? These questions are all part of a task called compilation, and even though it's less sexy than other areas, it's an actual performance bottleneck for most users in the real-world. We set out to investigate how to speed up certain subroutines with #GPUs, and in the process achieved something even more profound. The underlying problem is called the subgraph-isomorphism problem which is key to a range of #AI/ #machinelearning tasks. There are tons of algorithms allowing this problem to be solved, but most are stubbornly resistant to parallelization, rendering GPUs much less useful than in other areas. Until now. Working with NVIDIA and OQC, we developed a novel solution to this problem that combines insights from the graph database and analytics community, data science techniques, and leverages well-established open source software. Our new approach, named Δ-Motif, replaces traditional backtracking strategies with a data-centric approach that decomposes the graphs into fundamental motifs (small, reusable building blocks like paths and cycles), representing them in tabular formats and models graph processing with relational database operations like merges and filters. This shift transforms an inherently sequential problem into one that can be executed in parallel at scale, unlocking new levels of efficiency in graph processing. In an implementation on NVIDIA GPUs we achieved up to 600X speedups in wall clock time using test graphs, quantum-algorithm benchmarks (QASMBench) and classical ML benchmarks (SparseSuite Matrix Collection). This is an amazing example of how pushing the frontiers of PRACTICAL #quantumcomputing can deliver huge outcomes of much broader appeal. Our team is proud of this development and excited to continue expanding our partnerships with NVIDIA and OQC as we help deliver true #hybridcompute to the #datacenter. Jin-Sung Kim Oded Green Gerald Mullally Jamie Friel Jensen Huang Atsushi Sugiura Read more at our blog post: https://lnkd.in/gn-Bsurp Technical manuscript: https://lnkd.in/gxaJsumG
-
Most teams run GPUs far below their true potential: over 75% report peak utilization under 70%, even with billions being poured into hardware by 2025. Fujitsu’s AI Computing Broker (ACB) tackles this head-on by shifting from fixed allocation to real-time GPU orchestration. Instead of leaving GPUs idle during CPU-heavy phases, ACB dynamically assigns and reclaims resources through two components: • 𝗚𝗣𝗨 𝗔𝘀𝘀𝗶𝗴𝗻𝗲𝗿 — distributes workloads intelligently • 𝗔𝗱𝗮𝗽𝘁𝗶𝘃𝗲 𝗚𝗣𝗨 𝗔𝗹𝗹𝗼𝗰𝗮𝘁𝗼𝗿 — reclaims unused capacity on the fly No code changes. No workflow rewrites. Just more effective GPU usage. 🌍𝗥𝗲𝗮𝗹 𝗶𝗺𝗽𝗮𝗰𝘁: • In bioinformatics, ACB boosted AlphaFold2 throughput by 270% — from 12 to 32 proteins/hour • For LLM hosting, one server can now handle multiple models while keeping latency low • Works with Docker and Slurm, with Kubernetes support coming soon 𝗪𝗵𝗼 𝗯𝗲𝗻𝗲𝗳𝗶𝘁𝘀 𝘁𝗵𝗲 𝗺𝗼𝘀𝘁? ✔️Teams running workloads that alternate between CPU ↔ GPU phases ✔️Anyone juggling multiple concurrent GPU jobs ✔️Workloads needing full GPU memory only at certain steps ✔️Companies hosting several LLMs on shared hardware ✔️Organizations under pressure to cut infra costs without sacrificing performance Try beta: https://lnkd.in/g8UZXuPM How is your team managing GPU efficiency today? #GPUOptimization #artificialintelligence #AIInfrastructure #GenerativeAI
-
You're in a Senior ML Interview at NVIDIA. The interviewer sets a trap: "Your 7B model fits comfortably on a 24GB GPU. Yet, 10 minutes into a conversation, the service crashes with an Out-Of-Memory (OOM) error. Do we upgrade to an A100?" 90% of candidates walk right into it: "Yes, we need more VRAM." They think: "The model is running out of space, so we need a bigger bucket." This is the "Brute Force" approach. It solves the symptom for exactly one week until their users type longer prompts, and then they crash an 80GB card too. They just 4x'd the cloud bill without solving the physics of the problem. The reality is that they aren't optimizing for 𝐒𝐭𝐚𝐭𝐢𝐜 𝐌𝐞𝐦𝐨𝐫𝐲 (𝐖𝐞𝐢𝐠𝐡𝐭𝐬). They are dying from 𝐃𝐲𝐧𝐚𝐦𝐢𝐜 𝐒𝐭𝐚𝐭𝐞 (𝐂𝐨𝐧𝐭𝐞𝐱𝐭). In a production environment, GPU memory is consumed by two things: - 𝘔𝘰𝘥𝘦𝘭 𝘞𝘦𝘪𝘨𝘩𝘵𝘴: Fixed. (e.g., ~14GB for a 7B param model in FP16). - 𝘒𝘝 𝘊𝘢𝘤𝘩𝘦: Variable. This grows linearly with every single token generated. A 7B model with a batch size of 64 and a context length of 2048 tokens can generate over 30GB of KV cache. The "Ghost Memory" is larger than the model itself. ----- 𝐓𝐡𝐞 𝐒𝐨𝐥𝐮𝐭𝐢𝐨𝐧: The real problem isn't just the size of the cache - it's Memory Fragmentation. Standard PyTorch allocates contiguous memory blocks. As requests grow and shrink, they leave "holes" in your VRAM that are too small to use but add up to gigabytes of wasted space. This is The Swiss Cheese Effect. The fix isn't hardware. It's Architecture: 1️⃣ 𝘗𝘢𝘨𝘦𝘥𝘈𝘵𝘵𝘦𝘯𝘵𝘪𝘰𝘯 (𝘷𝘓𝘓𝘔): Treat GPU memory like an Operating System treats RAM. Break the KV cache into non-contiguous "pages" so you can fill every byte of VRAM without needing a continuous block. 2️⃣ 𝘒𝘝 𝘊𝘢𝘤𝘩𝘦 𝘖𝘧𝘧𝘭𝘰𝘢𝘥𝘪𝘯𝘨: If a user pauses for 30 seconds, move their KV cache to CPU RAM (cheap) and swap it back to GPU (expensive) only when they type again. 𝐓𝐡𝐞 𝐀𝐧𝐬𝐰𝐞𝐫 𝐓𝐡𝐚𝐭 𝐆𝐞𝐭𝐬 𝐘𝐨𝐮 𝐇𝐢𝐫𝐞𝐝: "Buying GPUs is a band-aid. The bottleneck is the KV Cache growing linearly with context. I would implement PagedAttention to eliminate memory fragmentation and KV Offloading to handle idle sessions. We only upgrade hardware if the active computation, not the idle state, saturates the compute units." #MachineLearning #DeepLearning #GenerativeAI #LLM #AIEngineering #MLOps #NVIDIA
-
I watched a senior engineer spend three weeks quantizing an LLM to 4-bit. The P99 latency got worse. The issue wasn’t the technique; it was treating quantization as a storage problem instead of a memory-bandwidth problem. At Twitter, I spent a month debugging why our "optimized" models ran slower than the originals. The models were smaller. The math was correct. Yet latency regressed. The missing piece: the *unpacking tax*. Here’s the reality most benchmarks hide: Time ≈ Total bytes moved / Memory bandwidth On paper, moving from FP16 (16-bit) to INT4 (4-bit) means 4× less data moving across the memory bus per token. In a memory-bound regime, that translates to 3–4× higher throughput. But there’s a catch. GPUs don’t compute in 4-bit or 8-bit. Those weights are dequantized back to FP16/BF16 in the local cache before computation. That dequantization costs clock cycles and creates production surprises: → High batch sizes: Time saved on memory movement dominates = throughput improves → Batch size of 1: Unpacking overhead dominates = latency gets worse Quantization is not a free win. It’s a tradeoff. If you’re choosing a method, align it with your deployment reality: → GPTQ: Effective for static weights, but sensitive to outliers → AWQ: Preserves critical weights at higher precision for better quality → GGUF: Excellent for CPU/Metal inference, less relevant for H100/A100 clusters This is Part 4 of a deep dive into inference optimization. Previous posts: Memory Wall: https://lnkd.in/gdT26UTV KV Cache: https://lnkd.in/gKkrqVzf Paged Attention: https://lnkd.in/gX5JNZhn Next up: I will break down the closest thing to "cheating physics" in ML - Speculative Decoding. What’s the most expensive quantization mistake you’ve seen in production - latency, quality, or operability?
-
nvidia now releases its most optimized inference kernels through a PhD student's open-source project. here's a breakdown of FlashInfer: FlashInfer is a GPU kernel library built specifically for LLM serving. it won Best Paper at MLSys 2025, powers both SGLang and vLLM, and NVIDIA now actively ships TensorRT-LLM kernels through it. the creator, Zihao Ye, built it during his PhD at UW and now works at NVIDIA full-time. LLM serving has a combinatorial explosion of attention kernels. every combination of KV-cache layout (paged, radix tree, tree masks), attention variant (GQA, MLA, RoPE-fused, sliding window), and batch mode (prefill, decode, append, shared prefix) needs a different kernel. FlashInfer's insight was: all KV-cache layouts are special cases of block-sparse matrices. paged attention is just block-sparse with page_size as block width. radix tree? block-sparse. tree attention for speculative decoding? block-sparse. one abstraction can replace what used to be separate kernel implementations. then you get JIT compilation to handle the variant explosion, in the form of CUDA/CUTLASS templates that get specialized at runtime there's two other major innovations built on top of FlashInfer: 1. cascade attention when multiple requests share a prefix (document QA, system prompts), FlashInfer decomposes attention into two stages: a multi-query kernel for the shared prefix (loaded once into SMEM, reused across all queries) and a batch decode kernel for unique suffixes. results merge using an associative operator on partial attention states. 31x speedup over vLLM's PagedAttention for 32K-token shared prefixes at batch size 256. 2. plan/run scheduling for CUDAGraph LLM serving has dynamic sequence lengths. CUDAGraphs need static configurations. FlashInfer solves this with a two-phase pattern: plan() inspects request shapes and computes balanced scheduling metadata, run() launches kernels. you plan once per decode step, then replay across all transformer layers. FlashInfer is an amazing project that i deeply respect, so also want to share some links for anyone that wants to go deeper: - paper (MLSys 2025 Best Paper): https://lnkd.in/gc_CTbnf - github: https://lnkd.in/gwfQ8B72 - NVIDIA blog: https://lnkd.in/gzs_uquk - cascade attention deep dive: https://lnkd.in/gHGqdNTV - docs: https://docs.flashinfer.ai
-
Product Requirement: p95 latency < 1s. KV Cache stops redundant math. PagedAttention stops memory waste. But the GPU is still slow because moving data around is more expensive than the math itself. Flash Attention fixes this. To understand why, you need to know that GPUs have two types of memory: → SRAM: tiny, extremely fast, lives on-chip → HBM: large, slow, where model weights and KV cache live The math itself is not the bottleneck. Floating point operations are fast. What is slow is the constant movement of data between SRAM and HBM. Standard attention mechanism has to read and write large intermediate matrices back and forth between the two and that is slow and memory intensive. Flash Attention fixed this by restructuring how attention is computed and optimizing the data movement between memory. 𝘐𝘯𝘴𝘵𝘦𝘢𝘥 𝘰𝘧 𝘱𝘳𝘰𝘤𝘦𝘴𝘴𝘪𝘯𝘨 𝘵𝘩𝘦 𝘧𝘶𝘭𝘭 𝘢𝘵𝘵𝘦𝘯𝘵𝘪𝘰𝘯 𝘮𝘢𝘵𝘳𝘪𝘹 𝘢𝘵 𝘰𝘯𝘤𝘦, 𝘪𝘵 𝘣𝘳𝘦𝘢𝘬𝘴 𝘵𝘩𝘦 𝘢𝘵𝘵𝘦𝘯𝘵𝘪𝘰𝘯 𝘮𝘢𝘵𝘳𝘪𝘹 𝘪𝘯𝘵𝘰 𝘴𝘮𝘢𝘭𝘭𝘦𝘳 𝘵𝘪𝘭𝘦𝘴 that fit entirely inside SRAM. Data stays on-chip and never makes that slow round trip back to the HBM. But tiling create one problem: one intermediate step in calculating attention scores is performing a softmax operation on a large matrix. Because softmax normalizes values, it needs to know the entire row of those matrices and tiling breaks that. Flash Attention solves this with 𝐨𝐧𝐥𝐢𝐧𝐞 𝐬𝐨𝐟𝐭𝐦𝐚𝐱 𝐫𝐞𝐬𝐜𝐚𝐥𝐢𝐧𝐠, which means as each new tile is processed, it rescales the previous results on the fly. By the end, the output is identical to standard attention. No approximation, no quality loss. This matters most during prefill, when the model processes your entire prompt at once and computes attention across all tokens simultaneously. For a long prompt, that attention matrix is massive — and this is where the SRAM/HBM bottleneck hits hardest. During decoding, attention is much lighter since you're only adding one token at a time.
-
The Rise of Python in NVIDIA's CUDA Ecosystem: A Paradigm Shift at GTC 2025 At this year's GTC, one thing became crystal clear: we've entered the "year of CUDA Python." The shift is more strategic than you might think. Stephen Jones, lead architect of the CUDA ecosystem, spent his talks showcasing ways to avoid writing traditional CUDA code. The highlight? cuTile - a new high-performance kernel library that ships exclusively with a Python interface. Not only is no C/C++ layer required -- it's not even available! What's driving this Python-first approach? In my estimation, it's the increasing complexity of properly programming Tensor Cores. As NVIDIA's hardware capabilities advance, the traditional CUDA programming model struggles to efficiently harness these powerful components - yet using them effectively is essential to justify the hardware investment. Even the CUTLASS project (NVIDIA's home for their highest-performance kernels) has completely reimagined its Python interface in version 4. This isn't a shallow wrapper - it's a comprehensive redesign that slashes compile times from minutes to mere seconds. The message is clear: NVIDIA recognizes that accessibility and developer experience are now as important as raw performance. Python's approachability opens GPU programming to a much wider audience while new abstractions help manage the growing complexity of modern GPU architectures. What do you think about this Python-first direction? Will it successfully democratize high-performance GPU programming? #CUDA #Python #GPU #NVIDIA #TensorCores #Engineering #DeveloperExperience
-
AI just delivered a computation breakthrough: Translating PyTorch to CUDA isn’t just a human problem anymore. Modern AI relies on GPU-optimized CUDA kernels, but handcrafting these requires rare expertise spanning algorithms, hardware, and memory hierarchies. This bottleneck now has a scalable solution: The AI CUDA Engineer. Sakana AI’s new framework uses Large Language Models (LLMs) to convert PyTorch operations into correct CUDA kernels and evolutionary optimization to iteratively maximize runtime efficiency. Key innovations: 1. Automatic translation (91% success rate) via error feedback loops 2. LLM-guided evolution combining model-generated variants with profiling data 2. Innovation Archive—a repository of 17K optimized kernels that seed future optimizations via RAG The results? A median 1.52x speedup over native PyTorch, with extreme gains like 54x faster diagonal matrix multiplications. Their system even translated and optimized full ResNet architectures into CUDA, achieving 1.44x speedups via fused shared-memory kernels. Why this matters: LLMs are moving beyond code generation to optimization—mastering hardware-specific constraints without human priors. With models writing code for 72% of PyTorch operations faster than torch.compile, democratizing GPU programming is no longer hypothetical. It's open for everyone: you can explore their open-sourced kernels or probe limitations 𝘳𝘪𝘨𝘩𝘵 𝘯𝘰𝘸. For industries like agriculture seeking location-specific AI—or anyone battling CUDA complexity—automating kernel engineering might just be the compute multiplier you need. Fore more on the AI CUDA Engineer and other AI highlights, check out this week's LLM Watch: https://lnkd.in/dfPZhpt6