Achieved 252–291 GiB/s HBM memory throughput on H100s, boosting LLM inference efficiency.

This title was summarized by AI from the post below.

8mo

Sustained ~252–291 GiB/s of HBM memory throughput on H100s under decode load — essentially hitting the hardware roofline. This matters because HBM throughput, not FLOPs, is the real bottleneck in LLM inference. By keeping memory nearly fully saturated, I’ve unlocked far higher efficiency and throughput than standard engines. The result: 0.9–1.36M tokens/sec with ~0.1 ms first-token latency #AI #LLM #GPURouter #H100 #Inference #CostEfficiency #Innovation #DeepLearning #AIInfrastructure #HighPerformanceComputing NVIDIA AMD OpenAI Google Shilpa Kolhatkar Keith Strier a16z speedrun

To view or add a comment, sign in

More Relevant Posts

Johan Karlsson
8mo
Report this post
Groq (not Grok from X.ai) is a really interesting company. They’re building deterministic AI chips focused purely on inference. Instead of relying on complex, unpredictable scheduling like GPUs, their compiler plans everything ahead of time. The result is super low latency and consistent performance — every token shows up right on time. Even though their current chips are built on an older 14nm process, they still deliver impressive speeds. That’s because of their streaming “assembly line” design and heavy use of on-chip SRAM, which avoids the delays of external memory. The chips aren’t the most dense or efficient yet, but the deterministic nature and architecture let them punch above their weight. Groq isn’t trying to beat Nvidia at its own game. They’re taking a different angle: betting that predictable, low-latency inference hardware will matter more than raw training power. It’s a bold approach, and it makes them stand out in a field where most players are just building faster GPUs.
Like Comment
To view or add a comment, sign in
The Success Digest

1,331 followers
7mo
Report this post
AMD stock +15% premarket to $185 on $20B OpenAI GPU deal—6GW Instinct chips with 10% stake option! This AMD OpenAI partnership challenges Nvidia's throne in AI compute. As a tech journalist, it's a duopoly dawn—AMD's efficiency edge could reshape $100B market. Dive into the AMD stock surge October 2025. Read more: https://lnkd.in/gzXaG7Tm #AMDStock #AMDOpenAI #AIChips #Semiconductors #TechNews #StockRally
Like Comment
To view or add a comment, sign in
TensorWave

12,430 followers
7mo Edited
Report this post
The tide just turned in AI infrastructure. 🌊 OpenAI is partnering with AMD in a multi-billion-dollar deal to deploy 6 GW of Instinct GPUs, starting with the MI450. For years, the world’s biggest AI labs had one option. Now, they have real choice. This isn’t just diversification - it’s validation. AMD is officially powering the future of intelligence. At TensorWave, we’ve been all-in on AMD from the start. If you're wanting access to MI355X series GPUs or wanting to reserve access to MI400X series GPUs, contact us today. Read our full breakdown here: https://lnkd.in/gh5fXHFg
2 Comments
Like Comment
To view or add a comment, sign in
QCT

13,915 followers
8mo
Report this post
Discover the unparalleled compute performance and networking speed delivered by #QuantaGrid D75U-1U, the compute tray to the NVIDIA GB300 NVL72 by QCT. It can be scaled up to 72 GPUs to create an AI cluster for #AIreasoning, #agenticAI, and video inference applications.
Like Comment
To view or add a comment, sign in
Rob Tiffany
7mo Edited
Report this post
OpenAI commits to scale their #AI with up to 6 gigawatts of AMD MI450 GPUs to keep pace with demand.
9 Comments
Like Comment
To view or add a comment, sign in
Advantech India

2,768 followers
8mo
Report this post
Experience Edge AI Like Never Before! Meet the Advantech AIR-420, powered by AMD Ryzen 7000/9000 Series processors - built to bring AI inference and LLM fine-tuning right to the edge. Compact, powerful, and deployment-ready, it’s changing the way industries unlock AI performance. Why it stands out: Dual high-performance GPU support for heavy AI workloads Space-saving 28.6L chassis, perfect for constrained environments No-code GenAI Studio for effortless LLM fine-tuning & inference at the edge Dive into the video to see the entire AIR-420 in action: https://lnkd.in/gJ2n7mvt #EdgeAI #GenAI #Advantech #LLM #VLM
Like Comment
To view or add a comment, sign in
BVM Ltd

1,665 followers
8mo
Report this post
AAEON MAXER-5100 – High-Performance AI Inference Server for Edge Computing The AAEON MAXER-5100 is an advanced AI inference server built for organisations that need high-speed data processing and reliable machine learning performance at the edge. Supporting 12th, 13th, and 14th Gen Intel Core processors on the LGA1700 socket, this system combines cutting-edge CPU technology with the power of NVIDIA RTX 2000 Ada GPUs for exceptional AI acceleration. Designed for industrial, research, and commercial applications, the MAXER-5100 delivers uncompromising performance in AI modelling, deep learning, and real-time analytics — all while maintaining industrial-grade reliability. https://buff.ly/kwRRONK #EmbeddedSystems #IndustrialComputing #OEMDesign #SystemIntegration #UKManufacturing #CustomHardware #BVM
Like Comment
To view or add a comment, sign in
Vitaly Igonin
7mo
Report this post
🚀 Compact. Quiet. Powerful. Introducing the Advantech AIR-410—a high-performance Edge AI HPC built for VLM/LLM inference, and semiconductor inspection. ⚡ Powered by AMD RyzenTM Embedded 8000 Series Processors ⚡ Supports one 3-Slot GPU for high-end edge inference ⚡ Ultra-quiet, space-saving design (<38.2 dBA) With Advantech’s Edge AI SDK & no-code tools, the AIR-410 makes secure, efficient AI at the edge a reality. Learn more: https://shr.bi/tBP0DOfW #EdgeAI #LLM #VLM #VisionAI #Advantech
Like Comment
To view or add a comment, sign in
Ujwal A Krishna
8mo
Report this post
Recomputing KV Cache for long prompts or repeated inputs adds unnecessary latency and consumes valuable GPU resources. The latest NVIDIA Dynamo release addresses this by offloading KV Cache to CPU RAM, SSDs, or even remote storage. This innovation reduces redundant computation and accelerates response times. Storage providers Vast Data and WEKA have already validated this approach, showing that KV Cache offload can be done efficiently at scale. In addition, the open source LMCache project has integrated its management layer into Dynamo, further enhancing caching capabilities for large language models. Read more: [https://lnkd.in/gjmDaPcq) #AI #MachineLearning #DeepLearning #GPU #NVIDIA #GenerativeAI #OpenSource #Innovation #DataInfrastructure #KVCache
Like Comment
To view or add a comment, sign in
Sara Raimondi
8mo
Report this post
See Edge AI in Action! Discover how Advantech AIR-420, powered by AMD Ryzen™ 7000/9000 Series processors, delivers next-level AI inference and LLM fine-tuning at the edge. Compact, powerful, and built for real-world deployment—this edge AI HPC redefines performance. ✨ Key Highlights: 🔹 Supports dual high-performance GPUs for demanding AI workloads 🔹 Compact 28.6L chassis for space-constrained environments 🔹 Integrated with no-code GenAI Studio for seamless LLM fine-tuning and inference at the edge 🎥 Watch the video to explore its features: https://shr.bi/3M0HM6Zp #EdgeAI #GenAI #Advantech #LLM #VLM
Like Comment
To view or add a comment, sign in

8,029 followers

View Profile Connect

Achieved 252–291 GiB/s HBM memory throughput on H100s, boosting LLM inference efficiency.

More from this author

The Origin of Heat and Energy Consumption during Computation: What Are We Missing?

Designing Against the Power Law: How Structure and Energy Input Can Turn Chaos into Predictable Results

How to Predict a Unicorn

Explore content categories

Achieved 252–291 GiB/s HBM memory throughput on H100s, boosting LLM inference efficiency.

More Relevant Posts

More from this author

The Origin of Heat and Energy Consumption during Computation: What Are We Missing?

Designing Against the Power Law: How Structure and Energy Input Can Turn Chaos into Predictable Results

How to Predict a Unicorn

Explore related topics

Explore content categories