The researchers at Google DeepMind just introduced "Matryoshka Quantization" (MatQuant), a clever new technique that could make deploying large language models much more efficient. The key insight? Rather than creating separate models for different quantization levels (int8, int4, int2), MatQuant leverages the nested "Matryoshka" structure naturally present in integer data types. Think of it like Russian nesting dolls - the int2 representation is nested within int4, which is nested within int8. Here are the major innovations: 1. Single Model, Multiple Precisions >> MatQuant trains one model that can operate at multiple precision levels (int8, int4, int2) >> You can extract lower precision models by simply slicing the most significant bits >> No need to maintain separate models for different deployment scenarios 2. Improved Low-Precision Performance >> Int2 models extracted from MatQuant are up to 10% more accurate than standard int2 quantization >> This is a huge breakthrough since int2 quantization typically severely degrades model quality >> The researchers achieved this through co-training and co-distillation across precision levels 3. Flexible Deployment >> MatQuant enables "Mix'n'Match" - using different precisions for different layers >> You can interpolate to intermediate bit-widths like int3 and int6 >> This allows fine-grained control over the accuracy vs. efficiency trade-off The results are impressive. When applied to the FFN parameters of Gemma-2 9B: >> Int8 and int4 models perform on par with individually trained baselines >> Int2 models show significant improvements (8%+ better on downstream tasks) >> Remarkably, an int2 FFN-quantized Gemma-2 9B outperforms an int8 FFN-quantized Gemma-2 2B This work represents a major step forward in model quantization, making it easier to deploy LLMs across different hardware constraints while maintaining high performance. The ability to extract multiple precision levels from a single trained model is particularly valuable for real-world applications. Looking forward to seeing how this technique gets adopted by the community and what further improvements it enables in model deployment efficiency! Let me know if you'd like me to elaborate on any aspect of the paper. I'm particularly fascinated by how they managed to improve int2 performance through the co-training approach. https://lnkd.in/g6mdmVjx
How Quantization is Transforming Model Performance
Explore top LinkedIn content from expert professionals.
Summary
Quantization is a method that compresses machine learning models by representing their data with fewer bits, making them faster and more suitable for deployment on limited hardware without sacrificing too much performance. New approaches are transforming large language models by making them smaller, easier to deploy, and often nearly as accurate as their full-size versions.
- Understand trade-offs: Choose lower bit quantization only when memory resources are limited and a small drop in model accuracy is acceptable for your application.
- Match hardware support: Always verify that your target hardware can efficiently handle the precision and quantization method you select to avoid unexpected slowdowns.
- Use smart techniques: Consider modern quantization strategies that let a single model serve multiple precision levels, saving storage and simplifying deployment across different platforms.
-
-
I watched a senior engineer spend three weeks quantizing an LLM to 4-bit. The P99 latency got worse. The issue wasn’t the technique; it was treating quantization as a storage problem instead of a memory-bandwidth problem. At Twitter, I spent a month debugging why our "optimized" models ran slower than the originals. The models were smaller. The math was correct. Yet latency regressed. The missing piece: the *unpacking tax*. Here’s the reality most benchmarks hide: Time ≈ Total bytes moved / Memory bandwidth On paper, moving from FP16 (16-bit) to INT4 (4-bit) means 4× less data moving across the memory bus per token. In a memory-bound regime, that translates to 3–4× higher throughput. But there’s a catch. GPUs don’t compute in 4-bit or 8-bit. Those weights are dequantized back to FP16/BF16 in the local cache before computation. That dequantization costs clock cycles and creates production surprises: → High batch sizes: Time saved on memory movement dominates = throughput improves → Batch size of 1: Unpacking overhead dominates = latency gets worse Quantization is not a free win. It’s a tradeoff. If you’re choosing a method, align it with your deployment reality: → GPTQ: Effective for static weights, but sensitive to outliers → AWQ: Preserves critical weights at higher precision for better quality → GGUF: Excellent for CPU/Metal inference, less relevant for H100/A100 clusters This is Part 4 of a deep dive into inference optimization. Previous posts: Memory Wall: https://lnkd.in/gdT26UTV KV Cache: https://lnkd.in/gKkrqVzf Paged Attention: https://lnkd.in/gX5JNZhn Next up: I will break down the closest thing to "cheating physics" in ML - Speculative Decoding. What’s the most expensive quantization mistake you’ve seen in production - latency, quality, or operability?
-
Groundbreaking Research Alert: 4-bit Quantization for RAG Systems A fascinating new paper from San José State University introduces an innovative approach to optimize Retrieval-augmented Generation (RAG) systems through 4-bit quantization of vector embeddings. >> Technical Deep Dive: The research tackles a critical challenge in RAG systems - the massive memory requirements for storing high-dimensional embedding vectors. Current top-ranked models on MTEB typically use embedding dimensions between 512-4096, consuming substantial memory resources. Consider this: A standard dbpedia dataset with 1M entries and 1536 dimensions requires 6.1GB of RAM just for embeddings. The proposed solution? A sophisticated 4-bit quantization approach that: - Reduces memory footprint by up to 87.5% - Maintains search accuracy within 4% of original performance - Implements group-wise quantization for enhanced precision - Outperforms HNSW algorithm in accuracy with group sizes ≤ 128 >> Under the Hood: The system employs symmetric linear quantization with group-wise processing, where vectors are split into equal-sized groups with individual quantization scales. This approach significantly outperforms traditional Product Quantization methods, maintaining correlation coefficients above 0.82 across multiple semantic textual similarity datasets. >> Impact: This breakthrough enables RAG deployment in resource-constrained environments while maintaining high accuracy. The research demonstrates that intelligent quantization can dramatically reduce infrastructure costs without compromising performance.
-
You're in a GenAI Engineer interview at Goldman Sachs, and the interviewer asks: "We need to deploy a 70B parameter LLM for production trading signals. Should we use 4-bit or 8-bit quantization? Justify your choice." Here's how you can answer: A. Most candidates fumble here because they only know "quantization reduces model size." Incomplete answer. B. There are 5 critical factors every GenAI engineer should understand cold. 𝟭. 𝗧𝗵𝗲 𝗣𝗿𝗲𝗰𝗶𝘀𝗶𝗼𝗻-𝗣𝗲𝗿𝗳𝗼𝗿𝗺𝗮𝗻𝗰𝗲 𝗧𝗿𝗮𝗱𝗲𝗼𝗳𝗳 - 𝗧𝗵𝗲 𝗳𝘂𝗻𝗱𝗮𝗺𝗲𝗻𝘁𝗮𝗹 𝗱𝗶𝗳𝗳𝗲𝗿𝗲𝗻𝗰𝗲 8-bit (INT8) maintains near-IDENTICAL accuracy: Performance degradation < 1% on most tasks Uses linear quantization: Q = round(scale × W + zero_point) 4-bit (INT4/NF4) trades accuracy for efficiency: Performance degradation 2-5% depending on architecture Uses non-linear quantization (NormalFloat4) to preserve distribution The brutal truth? 8-bit is production-safe. 4-bit requires EXTENSIVE validation. 𝟮. 𝗧𝗵𝗲 𝗠𝗲𝗺𝗼𝗿𝘆 𝗙𝗼𝗼𝘁𝗽𝗿𝗶𝗻𝘁 - 𝗪𝗵𝗲𝗿𝗲 𝟵𝟬% 𝗼𝗳 𝗲𝗻𝗴𝗶𝗻𝗲𝗲𝗿𝘀 𝗴𝗼 𝘄𝗿𝗼𝗻𝗴 Most people think "4-bit = 2x smaller than 8-bit." Wrong move. FP32: 70B model = 280GB INT8: 70B model = 70GB (4x compression) INT4: 70B model = 35GB (8x compression) But here's the catch - you STILL need overhead for KV cache, activations, and gradients. Real-world 70B INT4 deployment? Needs 48-60GB minimum, not 35GB. 𝟯. 𝗧𝗵𝗲 𝗤𝘂𝗮𝗻𝘁𝗶𝘇𝗮𝘁𝗶𝗼𝗻 𝗠𝗲𝘁𝗵𝗼𝗱 - 𝗧𝗵𝗲 𝗵𝗶𝗱𝗱𝗲𝗻 𝗽𝗿𝗼𝗱𝘂𝗰𝘁𝗶𝗼𝗻 𝗸𝗶𝗹𝗹𝗲𝗿 Here's what separates junior from senior GenAI engineers: Post-Training Quantization (PTQ): Fast setup (hours) Works reliably for 8-bit 4-bit quality varies wildly GPTQ/AWQ (Advanced PTQ): Weight-only quantization with calibration Industry standard for 4-bit LLMs Requires representative calibration dataset (CRITICAL) QAT (Quantization-Aware Training): Expensive compute (days to weeks) Required for mission-critical 4-bit deployments 𝟰. 𝗧𝗵𝗲 𝗜𝗻𝗳𝗲𝗿𝗲𝗻𝗰𝗲 𝗦𝗽𝗲𝗲𝗱 𝗧𝗿𝗮𝗱𝗲 - 𝟱𝘅 𝘁𝗵𝗿𝗼𝘂𝗴𝗵𝗽𝘂𝘁, 𝗯𝘂𝘁 𝘄𝗵𝘆? 8-bit: Native GPU support (Tensor Cores) Blazing fast matrix multiplication 1.5-2x throughput vs FP16 4-bit: Limited hardware support Requires dequantization to FP16 for computation Memory bandwidth bound, NOT compute bound The counterintuitive reality? 4-bit isn't always faster despite being smaller. 𝟱. 𝗧𝗵𝗲 𝗗𝗲𝗽𝗹𝗼𝘆𝗺𝗲𝗻𝘁 𝗥𝗲𝗮𝗹𝗶𝘁𝘆 - 𝗧𝗵𝗲 𝗰𝗼𝘀𝘁 𝗻𝗼𝗯𝗼𝗱𝘆 𝘁𝗮𝗹𝗸𝘀 𝗮𝗯𝗼𝘂𝘁 8-bit: A100 80GB fits 70B comfortably, batch size 8-16 supported 4-bit: RTX 4090 24GB runs 70B (barely), batch size 1-4 maximum 𝗪𝗵𝗲𝗻 𝟴-𝗯𝗶𝘁 𝘄𝗶𝗻𝘀: ✅ Accuracy non-negotiable (finance, healthcare) ✅ Production reliability > cost optimization ✅ Batch inference workloads 𝗪𝗵𝗲𝗻 𝟰-𝗯𝗶𝘁 𝘄𝗶𝗻𝘀: ✅ Extreme memory constraints ✅ Cost optimization critical ✅ Acceptable 2-5% quality degradation ✅ Using GPTQ/AWQ Packt is offering Christmas $9.99 deals : https://lnkd.in/gPNKdiXr Sonia Chauhan
-
Can we compress LLMs to 2 bits without destroying their intelligence? The push for extreme quantization usually hits a wall around 2 bits. At that level, standard methods often see a massive drop in reasoning capabilities, making the models unusable. A new paper, "Fairy2i", proposes an interesting solution: switching from real numbers to complex numbers. Instead of retraining complex-valued models from scratch (which is expensive), the authors developed a framework to transform existing pre-trained models (like LLaMA-2) into a "complex-valued" form. Here is how it works: 1️⃣ The Complex Shift: They map the standard real-valued weights into a complex domain using a codebook of fourth roots of unity: {1, -1, i, -i}. This utilizes the 2-bit space more efficiently than the standard ternary {-1, 0, 1} approach used in binary quantization. 2️⃣ No Retraining Required: They prove a mathematical equivalence that allows them to start with a standard pre-trained checkpoint, meaning you don't lose the knowledge of the foundation model. 3️⃣ Recursive Residuals: To fix the precision loss, they use a "recursive residual" strategy. They quantize the main weights, calculate the error (residual), and then quantize that error again. The final weight is just the sum of these simple terms. The performance recovery is significant with this technique. A LLaMA-2 7B model compressed to an effective 2-bit precision achieved 62.00% average zero-shot accuracy, compared to the full precision FP16 baseline of 64.72%. For context, this outperforms widely used methods like GPTQ (3-bit) and AQLM (2-bit) on perplexity metrics. Because the weights are just {1, -1, -i, i}, inference becomes multiplication-free (mostly additions and swaps), which is a major win for efficiency on commodity hardware. Limitation: While the math is solid, maximizing the speed benefits of complex-valued arithmetic might require specialized kernel implementations on current hardware to fully realize the theoretical latency gains. Kudos to the researchers from Peking University for the amazing work. #MachineLearning #LLM #Quantization #DataScience #AI
-
Google DeepMind Researchers Propose Matryoshka Quantization: A Technique to Enhance Deep Learning Efficiency by Optimizing Multi-Precision Models without Sacrificing Accuracy Researchers at Google DeepMind introduced Matryoshka Quantization (MatQuant) to create a single model that functions across multiple precision levels. Unlike conventional methods that treat each bit-width separately, MatQuant optimizes a model for int8, int4, and int2 using a shared bit representation. This allows models to be deployed at different precisions without retraining, reducing computational and storage costs. MatQuant extracts lower-bit models from a high-bit model while preserving accuracy by leveraging the hierarchical structure of integer data types. Testing on Gemma-2 2B, Gemma-2 9B, and Mistral 7B models showed that MatQuant improves int2 accuracy by up to 10% over standard quantization techniques like QAT and OmniQuant. Experimental evaluations of MatQuant demonstrate its ability to mitigate accuracy loss from quantization. Researchers tested the method on Transformer-based LLMs, focusing on quantizing Feed-Forward Network (FFN) parameters, a key factor in inference latency. Results show that MatQuant’s int8 and int4 models achieve comparable accuracy to independently trained baselines while outperforming them at int2 precision. On the Gemma-2 9B model, MatQuant improved int2 accuracy by 8.01%, while the Mistral 7B model saw a 6.35% improvement over traditional quantization methods. The study also found that MatQuant’s right-shifted quantized weight distribution enhances accuracy across all bit-widths, particularly benefiting lower-precision models. Also, MatQuant enables seamless bit-width interpolation and layer-wise Mix’n’Match configurations, allowing flexible deployment based on hardware constraints...... Read full article: https://lnkd.in/gWTcqSCN Paper: https://lnkd.in/ggAF-sjf Google DeepMind Pranav Nair PURANJAY DATTA Jeff Dean Prateek Jain Aditya Kusupati
-
Day 20/30 of SLMs/LLMs: Making Models Smaller, Faster, and Cheaper One of the biggest breakthroughs in scaling efficient AI has not come from new architectures but from better numerics. 𝐐𝐮𝐚𝐧𝐭𝐢𝐳𝐚𝐭𝐢𝐨𝐧 allows us to run large models in smaller, faster formats without retraining from scratch. It is one of the main reasons 7B and even 70B parameter models can now run on commodity hardware. At its core, quantization reduces the precision of model weights and activations. Instead of storing parameters as 32-bit floating-point numbers, we use 16-bit, 8-bit, or even 4-bit integers. This dramatically cuts memory usage and bandwidth, enabling faster inference with minimal quality loss. 8-bit quantization has become the new default for deployment. Libraries such as bitsandbytes, 𝐓𝐞𝐧𝐬𝐨𝐫𝐑𝐓-𝐋𝐋𝐌, and DeepSpeed-Inference make it possible to load large models in 8-bit format with almost no drop in performance. The technique works by grouping weights, scaling them dynamically, and using efficient lookup-based matrix multiplication. 4-bit quantization pushes compression even further. Frameworks like 𝐐𝐋𝐨𝐑𝐀 (Quantized Low-Rank Adaptation) demonstrated that 4-bit quantized models can still be fine-tuned effectively using LoRA adapters. QLoRA reduced memory usage by up to 75% while achieving results within 99% of full-precision baselines on benchmarks such as MMLU and GSM8K. This is why many recent open models (LLaMA 2, Mistral, Falcon) are distributed with 4-bit or GGUF quantized weights ready for local deployment. Mixed precision is another key approach where different layers or operations use different bit depths. Attention layers, for instance, might run in FP16 for stability while feedforward layers run in INT8 for speed. This fine-grained control allows hardware like NVIDIA’s A100 or Apple’s M-series chips to balance throughput and accuracy dynamically. The impact of quantization is dramatic. A 13B model that requires 26 GB of VRAM at full precision can run comfortably in 7 GB at 4-bit precision with less than 1% accuracy loss. This shift makes 𝐦𝐨𝐝𝐞𝐥 𝐝𝐞𝐩𝐥𝐨𝐲𝐦𝐞𝐧𝐭 𝐨𝐧 𝐚 𝐬𝐢𝐧𝐠𝐥𝐞 𝐆𝐏𝐔 𝐨𝐫 𝐞𝐯𝐞𝐧 𝐚 𝐥𝐚𝐩𝐭𝐨𝐩 𝐫𝐞𝐚𝐥𝐢𝐬𝐭𝐢𝐜 𝐟𝐨𝐫 𝐝𝐞𝐯𝐞𝐥𝐨𝐩𝐞𝐫𝐬 𝐚𝐧𝐝 𝐫𝐞𝐬𝐞𝐚𝐫𝐜𝐡𝐞𝐫𝐬. What has your experience with quantization been like?
-
We've just revamped the @Huggingface Quantization docs! 🥳 Understand concepts better & choose the right technique for your needs with these key updates: - Explanations of quantization fundamentals (schemes, int4, FP8). https://lnkd.in/etQG9FQw - New Selection Guide: Choose the right technique (bnb, AWQ, GPTQ, HQQ, etc.) for your specific needs & hardware. https://lnkd.in/eRVyQsAW - Benchmarks: accuracy & performance data for popular quantization methods on Llama 3.1 8B & 70B. https://lnkd.in/eqSNvsTa What's quantization? It shrinks models (like Llama 3) & speeds up inference by using lower precision (int8, int4, FP8). Think smaller footprint, faster results! Our new concept guide covers key ideas like: 🔹 Affine vs Symmetric 🔹 int4 Packing 🔹 FP8 (E4M3 vs E5M2) https://lnkd.in/etQG9FQw 🔥 Benchmarks! We tested popular methods (bitsandbytes, AWQ, GPTQ, HQQ, torchao, FP8 & more) on Llama 3.1 8B & 70B. Key Takeaways: 8-bit: Matches baseline accuracy, ~2x memory saving. 4-bit: Great balance (~4x saving), AWQ/GPTQ often lead accuracy (need calibration), bnb/HQQ easy on-the-fly. Sub-4-bit: Max compression, but bigger accuracy drop. See the results: https://lnkd.in/eqSNvsTa Which method for YOU? Our new "Selecting a Quantization Method" guide helps you decide! We compare: On-the-fly (Easy): bitsandbytes, HQQ, torchao - No calibration needed. Calibration-based (High Accuracy): AWQ, GPTQ - Need data, potentially better results. Fine-tuning: QLoRA via bitsandbytes is the standard. Specific Formats: Loading FP8/Sparse via compressed-tensors. https://lnkd.in/eRVyQsAW
-
One of the challenges in a field full of excitement and hype is staying grounded and pursuing fundamental insights. This is what Jinendra Malekar and I tried to do in our recent paper, now available in 𝘛𝘳𝘢𝘯𝘴𝘢𝘤𝘵𝘪𝘰𝘯𝘴 𝘰𝘯 𝘔𝘢𝘤𝘩𝘪𝘯𝘦 𝘓𝘦𝘢𝘳𝘯𝘪𝘯𝘨 𝘙𝘦𝘴𝘦𝘢𝘳𝘤𝘩 (𝘛𝘔𝘓𝘙): “𝗔𝗺𝗱𝗮𝗵𝗹’𝘀 𝗟𝗮𝘄 𝗳𝗼𝗿 𝗟𝗟𝗠𝘀: 𝗔 𝗧𝗵𝗿𝗼𝘂𝗴𝗵𝗽𝘂𝘁-𝗖𝗲𝗻𝘁𝗿𝗶𝗰 𝗔𝗻𝗮𝗹𝘆𝘀𝗶𝘀 𝗼𝗳 𝗘𝘅𝘁𝗿𝗲𝗺𝗲 𝗟𝗟𝗠 𝗤𝘂𝗮𝗻𝘁𝗶𝘇𝗮𝘁𝗶𝗼𝗻.” Link to the Paper: https://lnkd.in/eTPCewfQ 🔍𝗧𝗵𝗲 𝗸𝗲𝘆 𝗰𝗼𝗻𝘁𝗿𝗶𝗯𝘂𝘁𝗶𝗼𝗻𝘀: • A 𝘁𝗵𝗿𝗼𝘂𝗴𝗵𝗽𝘂𝘁-𝗰𝗲𝗻𝘁𝗿𝗶𝗰 𝗮𝗻𝗮𝗹𝘆𝘀𝗶𝘀 of mixed-precision LLMs, where projection layers are aggressively quantized (<𝟰-𝗯𝗶𝘁) while attention heads remain higher precision (𝗜𝗡𝗧𝟴/𝗙𝗣𝟭𝟲) to preserve accuracy. • An adaptation of 𝗔𝗺𝗱𝗮𝗵𝗹’𝘀 𝗟𝗮𝘄 for LLMs, providing a quantitative framework to reason about throughput ceilings under extreme quantization. • Extensive experiments across diverse LLM architectures (𝗚𝗣𝗧, 𝗢𝗣𝗧, 𝗟𝗟𝗮𝗠𝗔) and hardware backends ( 𝗘𝗱𝗴𝗲𝗧𝗣𝗨, 𝗖𝗹𝗼𝘂𝗱𝗧𝗣𝗨, and 𝗚𝗣𝗨) 💡 Our finding show that while extreme quantization can significantly boost LLM throughput, the gains are ultimately limited by the most constrained parts of the model, which depend heavily on both 𝗺𝗼𝗱𝗲𝗹 𝗵𝘆𝗽𝗲𝗿𝗽𝗮𝗿𝗮𝗺𝗲𝘁𝗲𝗿𝘀 (e.g. context length, embedding dimensions) and 𝗵𝗮𝗿𝗱𝘄𝗮𝗿𝗲 𝗮𝗿𝗰𝗵𝗶𝘁𝗲𝗰𝘁𝘂𝗿𝗲. In other words, 𝙏𝙝𝙚𝙧𝙚’𝙨 𝙣𝙤 𝙤𝙣𝙚-𝙨𝙞𝙯𝙚-𝙛𝙞𝙩𝙨-𝙖𝙡𝙡 𝙨𝙤𝙡𝙪𝙩𝙞𝙤𝙣! This can provide a roadmap for designing more holistic quantization strategies that push LLM performance further. We’d love to hear your thoughts on where extreme LLM quantization should head next: 𝘐𝘴 𝘵𝘩𝘦 𝘣𝘪𝘨𝘨𝘦𝘳 𝘱𝘳𝘪𝘰𝘳𝘪𝘵𝘺 𝘥𝘦𝘷𝘦𝘭𝘰𝘱𝘪𝘯𝘨 𝙘𝙪𝙨𝙩𝙤𝙢 𝙝𝙖𝙧𝙙𝙬𝙖𝙧𝙚 𝘵𝘰 𝘧𝘶𝘭𝘭𝘺 𝘤𝘢𝘱𝘪𝘵𝘢𝘭𝘪𝘻𝘦 𝘰𝘯 𝘪𝘵, 𝘰𝘳 𝘢𝘥𝘷𝘢𝘯𝘤𝘪𝘯𝘨 𝙖𝙡𝙜𝙤𝙧𝙞𝙩𝙝𝙢𝙞𝙘 𝙞𝙣𝙣𝙤𝙫𝙖𝙩𝙞𝙤𝙣𝙨 𝘵𝘰 𝘮𝘢𝘬𝘦 𝘢𝘵𝘵𝘦𝘯𝘵𝘪𝘰𝘯 𝘤𝘰𝘮𝘱𝘶𝘵𝘢𝘵𝘪𝘰𝘯 𝘮𝘰𝘳𝘦 𝘦𝘧𝘧𝘪𝘤𝘪𝘦𝘯𝘵? #LLM #Quantization #EfficientAI