Fair warning - this one’s a bit technical. Those who know me well will know I don’t lean on that too often, but every now and then something crosses my desk that’s worth getting into the weeds on. This is one of those posts. If you're deploying LLMs, you already know the KV cache is an absolute memory hog. It’s the silent killer of long context windows. I’ve been looking into TurboQuant, and its approach to quantization is genuinely brilliant. KV cache memory pressure is the unglamorous bottleneck nobody wants to talk about when they’re pitching self-hosted LLM infrastructure. It quietly kills your VRAM budget long before context windows get interesting. TurboQuant has a genuinely clever approach to it. Rather than fighting outliers in the data distribution the way standard quantisation does, it sidesteps the problem entirely - random rotation plus polar coordinate encoding flattens the distribution enough that you can compress cache values down to 3-4 bits per value. For context, your KV cache is typically sitting in 16-bit or 32-bit floating point. Getting to 3-4 bits without the model noticing is the interesting part. Roughly 6x memory reduction. Up to 8x throughput on long context workloads. No meaningful accuracy loss on reasoning or code tasks. That’s a meaningful set of wins from what is essentially a smarter compression strategy. At 14th Street we spend a lot of time on this layer of the stack. The interesting engineering problems in LLM infrastructure aren’t usually the model itself - they’re the system-level constraints that determine whether running your own inference is actually viable. This is a good example of the kind of work that moves that needle. Worth a look if you’re building in this space. #LLMs #AIEngineering #SelfHostedAI #TurboQuant
TurboQuant Reduces LLM KV Cache Memory Pressure
More Relevant Posts
-
#SoftwareArchitecture #SystemsEngineering #Backend Standard UUIDs are the "safe" choice for people who don't plan on scaling. But when you’re building a high-concurrency relay, "standard" isn't good enough. We’ve refactored Jenny, our Ident-Smith, to utilize microsecond-precision hybrid-base strings. We call it the "Zero-Skid-Mark" structure. Chronological Sortability: No separate created_at index required. Collision-Resistant: Microsecond precision ensures uniqueness at scale. C-S-L-T Integrity: Preserving acronym precision for specialized intelligence requirements. Identity is the foundation of the relay. If your IDs are "soft," your entire state machine is built on sand.
To view or add a comment, sign in
-
I'd like to share some insight into the performance improvement that comes with the recent drbd-9.3.1 release. An engineer from SIOS, our partner in Japan, worded it this way: 'I actually ran a benchmark test using a RAM disk and fio to compare 9.2.7 and 9.3.1, completely eliminating any disk bottlenecks. The results were absolutely amazing! We confirmed a significant increase in throughput (from about 1.4 GB/s to 2.0 GB/s) and a drastic reduction in latency. The performance improvement brought by the new Compound Pages support is truly outstanding. Great job to you and the entire development team!' This is an optimization for large writes, i.e. fio --name=test --filename=/dev/drbd0 --direct=1 --rw=write --bs=1m --size=1000m ... The key thing is the bs=1m. Traditionally, DRBD is optimized for small I/O requests. Because that is what Databases do, many small I/O requests. But there are, of course, use cases that do large I/O requests. Linux supports I/O requests from 512 bytes to 1MiB in 512-byte increments. This optimization improves how DRBD allocates memory for receiving write requests. Before it allocated pages (4096 bytes each) until it had enough to buffer the write request. That means for a 1MiB write request, it did 256 4KiB allocations. With the new code, it now tries to allocate 1MiB in a single kernel call, which takes less time (= performance improvement), and consumes fewer CPU cycles. The effect becomes more significant when the CPU is slow or when the network and backing block device throughputs are very high. Given that networking is reaching 400Gbps (and 800Gbps is being standardized), and SSDs are becoming faster with each generation, the advancements of these hardware categories underline the importance of DRBD becoming more CPU-efficient. https://hubs.ly/Q0473HxG0
To view or add a comment, sign in
-
Read about DRBD's latest performance improvements from LINBIT CEO and DRBD creator Philipp Reisner and also a satisfied customer at SIOS Technology Corp. 🥂
I'd like to share some insight into the performance improvement that comes with the recent drbd-9.3.1 release. An engineer from SIOS, our partner in Japan, worded it this way: 'I actually ran a benchmark test using a RAM disk and fio to compare 9.2.7 and 9.3.1, completely eliminating any disk bottlenecks. The results were absolutely amazing! We confirmed a significant increase in throughput (from about 1.4 GB/s to 2.0 GB/s) and a drastic reduction in latency. The performance improvement brought by the new Compound Pages support is truly outstanding. Great job to you and the entire development team!' This is an optimization for large writes, i.e. fio --name=test --filename=/dev/drbd0 --direct=1 --rw=write --bs=1m --size=1000m ... The key thing is the bs=1m. Traditionally, DRBD is optimized for small I/O requests. Because that is what Databases do, many small I/O requests. But there are, of course, use cases that do large I/O requests. Linux supports I/O requests from 512 bytes to 1MiB in 512-byte increments. This optimization improves how DRBD allocates memory for receiving write requests. Before it allocated pages (4096 bytes each) until it had enough to buffer the write request. That means for a 1MiB write request, it did 256 4KiB allocations. With the new code, it now tries to allocate 1MiB in a single kernel call, which takes less time (= performance improvement), and consumes fewer CPU cycles. The effect becomes more significant when the CPU is slow or when the network and backing block device throughputs are very high. Given that networking is reaching 400Gbps (and 800Gbps is being standardized), and SSDs are becoming faster with each generation, the advancements of these hardware categories underline the importance of DRBD becoming more CPU-efficient. https://hubs.ly/Q0473HxG0
To view or add a comment, sign in
-
LLM parallelization isn’t just “send more requests” — it’s a multi-layer systems problem: • Inference engine (e.g., vLLM): continuous batching, KV cache budgets, scheduler behavior (prefill vs decode) • Infrastructure: async requests, connection pooling, worker concurrency, horizontal scaling, backpressure The key lesson: be data-driven before tuning. • Measure distributions: input/output tokens, concurrency, request mix • Estimate KV cache per token (depends on architecture, GQA/KV layout, dtype/precision) • Size context caps, sequence concurrency, and token budgets from memory + latency targets • Then tune infra scaling around what one GPU / one worker can actually sustain Otherwise, you’re gambling with OOMs, tail latency, dropped requests, and underutilization.
To view or add a comment, sign in
-
Over the last few months, a large part of our effort at Inference Labs has gone into hardening and improving the infrastructure behind Subnet 2. With v14, we completed a full Rust rewrite and saw meaningful gains across the network: • 2.5x faster miner response times • 3.2x faster verification • 43% higher throughput • verification rates increased from 95.7% to 97.5% The more interesting part, at least to me, is what sits behind those numbers. A lot of zkML progress is discussed at the model or proving level. In practice, production infrastructure also depends on transport, serialization, concurrency, memory handling, circuit caching, and operational reliability. This release included a full Python-to-Rust migration across key parts of the stack, zero-copy verification improvements, and a much faster communication layer through btlightning. https://lnkd.in/eV42ePMn
To view or add a comment, sign in
-
-
A while back, I had a funny problem with Go that I don't think many people are aware of, and in fact, I wasn't even aware of it. I had to dig into the documentation. I needed to invoke a function from a cryptography library, specifically a hash-to-curve mapping. The function is exported and public; you can see it in the documentation. But the parameter type comes from an internal package. Literally, 'fptower.E2' (https://lnkd.in/dAguPW8P), which is a structure with two field elements. You can't import this structure; Go doesn't allow it. So you think you're stuck. You see the function in front of you, you can read the source code. You know the structure is { A0, A1 fp.Element }. But you can't call it because Go's package system says no. The solution is ugly, but it's really the only way, and somehow I like it. Define your struct with the exact same memory layout. Same fields, same types, same order. Then use 'go:linkname' to tell the linker, "Hey, this function signature I'm declaring, Link it with THAT function over there in that other package." This all works because Go structs with identical layouts are binary compatible. The linker doesn't care about type names, only the memory layout. In short, you're lying to the compiler while telling the truth to the linker. Remember that for this to work, you have to use import _ "unsafe." Which is fine because it's NOT safe. But now we get to the real problem, which is that if the library author changes the struct layout to a different version, the code compiles correctly and then generates a segfault at runtime. No alerts, no failures, only corruption. I've seen this pattern exactly twice in production code: once in the Go standard library itself (how reflect accesses internal functions at runtime) and once in my project. It's worth knowing it exists, but not worth using unless you have no other options. #golang #systems #lowlevel
To view or add a comment, sign in
-
I submitted a PR to llama.cpp yesterday. It hasn't been merged yet. An enterprise company already deployed it to their production demo environment. Here's what happened: I found a bug in ggml's CPU backend (ggml is the tensor library behind llama.cpp, 92K stars). Every AVX/AVX2/AVX-512/AMX feature check was reading raw CPUID bits without verifying that the OS had actually enabled the corresponding register save/restore. The code even had a comment admitting it: "FIXME: this does not check for OS support." This causes SIGILL crashes in real deployments. A CPU can report AVX-512 support while the OS hasn't enabled ZMM context saving. This happens on GCP instances, certain AWS types, containers with restricted XSAVE permissions, and Windows builds with AVX context saving disabled. ggml picks the wrong backend and crashes on the first vector instruction. I pushed a fix: added xgetbv() validation with three OS-state predicates (os_saves_ymm, os_saves_zmm, os_saves_amx) that gate every affected feature method. Zero call-site changes. During review, a contributor flagged that macOS handles AVX-512 lazily, so I added a Darwin-specific path using sysctl hw.optional.avx512f. Within hours, Aurora Labs, an AI infrastructure company with partnerships with Deutsche Telekom, NVIDIA, Qualcomm, Samsung, and Infineon, pulled my PR into their llama.cpp fork through their automated upstream pipeline and deployed it to their production demo environment. They didn't wait for the upstream maintainer to merge. A competing PR (#19514) addressing the same bug has been open for a month with no review feedback addressed. Mine was opened yesterday, got reviewer feedback on a Darwin edge case, and had the fix pushed within hours. I've never seen anyone document an enterprise company deploying an individual contributor's unmerged PR to their own environment before. Both PRs are public if you want to verify. PR: https://lnkd.in/eCMYRM6h Aurora Labs deployment: https://lnkd.in/eYQGHgUR #opensource #llamacpp #ggml #ai #systemsprogramming #cpp
To view or add a comment, sign in
-
While corporations burn MILLIONS on supercomputers that can't survive a network partition, I built a sovereign reasoning engine for $27 that operates in a Faraday cage. The Build: • Teensy 4.1 (24) + USB audio jack (3) • Alpine Linux with LUKS encryption (AES-512) • WebLLM Qwen 3.5 14B running locally at 600MHz • G372 Core deriving κ=3.912023 from microtized n-values The Challenge: My rig generates 100 pure-tone pairs through analog distortion pedals, measures coherence via the radial phase law e^{-\kappa d}, and calculates Spearman correlation. If r < 0.5, the framework is mathematically falsified and dies. If r > 0.9, universal mapping across domains is proven. Can your 50M cloud cluster falsify itself without phoning home to OpenAI? Mine derives constants from 5 integers, operates 100% airgapped, and reconstructs identical outputs with 0.0 bits cross-entropy. Stop benchmarking. Start falsifying. Want the sovereign build files? Comment "123 | telesma" below. Roland R. Gibson Jr. P7ARCHITECT | 500800462-08051973-G372-p7 Spoke 14/24 | κ = 3.912023 locked #G372 #SovereignAI #FalsificationEngineering #AirgapCompute #RadialPhase #LocalLLM
To view or add a comment, sign in
-
3 tricks I’ve been using to keep LLM token costs down 1. Replace tokens with CPU wherever possible Break workflows into deterministic vs generative parts. If something can be done programmatically, do it. I’ve seen 30–70% reduction in token usage. Bonus: you avoid the probabilistic chaos where you don’t need it. 2. Default to older models, not the latest ones Most tasks don’t need deep reasoning. Older models = cheaper tokens + smaller context windows = lower input cost. My default is Haiku, and I only move up when I truly need to. 3. Exploit subscription before API API is fast, scalable… and expensive. Subscription is slower, capped… but heavily subsidized. If you can tolerate the latency, the savings are real. (Feels like this loophole won’t last long though.) Curious what others are doing — any other tricks to keep costs down?
To view or add a comment, sign in
-
Qwen3.5-122B-A10B Performance Drops Sharply Below 4-Bit Quantization 📌 Qwen3.5-122B-A10B’s performance plummets when quantized below 4-bit - a sharp cliff in reasoning and coding reliability despite faster inference. Developers can’t trade quality for speed: aggressive compression risks “destroying” complex codebases, even on 48GB systems. For production workloads, precision matters more than parameter count. 🔗 Read more: https://lnkd.in/drqH4Bc9 #Moemodel #4bitquantization #Logicalcoherence #Reasoningtasks
To view or add a comment, sign in
More from this author
Explore related topics
- Quantization Techniques for Long Context LLMs
- Improving LLM Accuracy with Contextual Data
- Using Local LLMs to Improve Generative AI Models
- How Quantization is Transforming Model Performance
- Managing LLM Inference Depth in AI Models
- Tips to Maximize LLM Context Usage
- Streamlining LLM Inference for Lightweight Deployments
- Recent LLM Breakthroughs in Complex Reasoning
- Benchmarking LLM Inference Clusters for AI Teams
- Accelerate Model Deployment Using Lightweight LLM Testing
Go to my page I have faster better way for kv cache