The researchers at Google DeepMind just introduced "Matryoshka Quantization" (MatQuant), a clever new technique that could make deploying large language models much more efficient. The key insight? Rather than creating separate models for different quantization levels (int8, int4, int2), MatQuant leverages the nested "Matryoshka" structure naturally present in integer data types. Think of it like Russian nesting dolls - the int2 representation is nested within int4, which is nested within int8. Here are the major innovations: 1. Single Model, Multiple Precisions >> MatQuant trains one model that can operate at multiple precision levels (int8, int4, int2) >> You can extract lower precision models by simply slicing the most significant bits >> No need to maintain separate models for different deployment scenarios 2. Improved Low-Precision Performance >> Int2 models extracted from MatQuant are up to 10% more accurate than standard int2 quantization >> This is a huge breakthrough since int2 quantization typically severely degrades model quality >> The researchers achieved this through co-training and co-distillation across precision levels 3. Flexible Deployment >> MatQuant enables "Mix'n'Match" - using different precisions for different layers >> You can interpolate to intermediate bit-widths like int3 and int6 >> This allows fine-grained control over the accuracy vs. efficiency trade-off The results are impressive. When applied to the FFN parameters of Gemma-2 9B: >> Int8 and int4 models perform on par with individually trained baselines >> Int2 models show significant improvements (8%+ better on downstream tasks) >> Remarkably, an int2 FFN-quantized Gemma-2 9B outperforms an int8 FFN-quantized Gemma-2 2B This work represents a major step forward in model quantization, making it easier to deploy LLMs across different hardware constraints while maintaining high performance. The ability to extract multiple precision levels from a single trained model is particularly valuable for real-world applications. Looking forward to seeing how this technique gets adopted by the community and what further improvements it enables in model deployment efficiency! Let me know if you'd like me to elaborate on any aspect of the paper. I'm particularly fascinated by how they managed to improve int2 performance through the co-training approach. https://lnkd.in/g6mdmVjx
Quantization Techniques for Long Context LLMs
Explore top LinkedIn content from expert professionals.
Summary
Quantization techniques for long context large language models (LLMs) are methods used to shrink model sizes and speed up their processing by representing their data with fewer bits—think of it as simplifying the math without losing too much of the meaning. These innovations allow LLMs to handle longer inputs and run faster on more limited hardware, making them more accessible and practical for real-world tasks.
- Explore precision options: Choose between 8-bit, 4-bit, or even 2-bit quantization to strike a balance between memory savings and accuracy, with newer methods allowing a single model to support multiple bit-widths.
- Use adaptive techniques: Implement runtime approaches like per-token precision adaptation or dynamic halting to allocate resources only where needed, making inference more efficient for longer sequences.
- Consider specialized methods: Look into advanced solutions, such as vector post-training quantization or learned matrix rotations, to further reduce errors and maintain high performance in compressed LLMs.
-
-
Exciting breakthrough in extreme low-bit quantization for Large Language Models! The good folks at Microsoft have developed VPTQ (Vector Post-Training Quantization), a novel approach to LLM compression. They achieved reduced model quantization perplexity by 0.01-0.34 on LLaMA-2, 0.38-0.68 on Mistral-7B, and 4.41-7.34 on LLaMA-3 over the SOTA at 2-bit. On paper, this looks extremely interesting. VPTQ (Vector Post-Training Quantization), GPTQ (Generative Pre-trained Transformer Quantization), and AWQ (Activation-Aware Weight Quantization) are all post-training quantization methods for large language models, but they differ in their approaches and performance characteristics. VPTQ uses Second-Order Optimization and Channel-Independent Second-Order Optimization to achieve extreme low-bit quantization (down to 2 bits) while maintaining competitive accuracy and inference speed. It outperforms GPTQ in terms of accuracy and compression ratio, especially at very low bit widths. GPTQ uses a one-shot weight quantization method based on approximate second-order information, achieving good results at 4 bits but struggling at lower precisions. AWQ, on the other hand, focuses on identifying and preserving critical weights during quantization, resulting in faster inference than GPTQ and sometimes better perplexity, though at the cost of slightly higher VRAM usage. Overall, VPTQ appears to offer the best balance of compression, accuracy, and speed, particularly for extreme low-bit scenarios. Key Steps for Implementing Vector Post-Training Quantization (VPTQ) for Large Language Models: 1. Formulate the quantization problem: - Use Second-Order Optimization to guide the quantization algorithm design. - Employ Channel-Independent Second-Order Optimization for granular vector quantization. 2. Initialize centroids: - Implement Hessian-Weighted Centroid Initialization. - Solve it as a Weighted K-means Clustering problem. 3. Quantize the model weights: - Iterate through each layer of the model. - For each Linear operator: a. If outlier elimination is enabled, quantize outlier weights first. b. Initialize centroids for remaining weights. c. Apply the VPTQ algorithm to quantize weights. d. If residual quantization is enabled, quantize the residual error. 4. Implement Residual Vector Quantization (optional): - Use multiple stages to further compress residual errors. - Employ separate lookup tables for each stage. 5. Apply outlier elimination (optional) 6. Perform layer-wise fine-tuning: - Fine-tune centroids and layer normalization parameters. - Use a small calibration dataset (e.g., 128 samples from C4). 7. Optimize for inference: - Implement efficient dequantization by reading centroids from codebooks. - Fuse dequantization and matrix multiplication operations where possible. VPTQ enables extreme compression of LLMs while maintaining remarkable accuracy, paving the way for more efficient deployment and inference of these powerful models.
-
We've just revamped the @Huggingface Quantization docs! 🥳 Understand concepts better & choose the right technique for your needs with these key updates: - Explanations of quantization fundamentals (schemes, int4, FP8). https://lnkd.in/etQG9FQw - New Selection Guide: Choose the right technique (bnb, AWQ, GPTQ, HQQ, etc.) for your specific needs & hardware. https://lnkd.in/eRVyQsAW - Benchmarks: accuracy & performance data for popular quantization methods on Llama 3.1 8B & 70B. https://lnkd.in/eqSNvsTa What's quantization? It shrinks models (like Llama 3) & speeds up inference by using lower precision (int8, int4, FP8). Think smaller footprint, faster results! Our new concept guide covers key ideas like: 🔹 Affine vs Symmetric 🔹 int4 Packing 🔹 FP8 (E4M3 vs E5M2) https://lnkd.in/etQG9FQw 🔥 Benchmarks! We tested popular methods (bitsandbytes, AWQ, GPTQ, HQQ, torchao, FP8 & more) on Llama 3.1 8B & 70B. Key Takeaways: 8-bit: Matches baseline accuracy, ~2x memory saving. 4-bit: Great balance (~4x saving), AWQ/GPTQ often lead accuracy (need calibration), bnb/HQQ easy on-the-fly. Sub-4-bit: Max compression, but bigger accuracy drop. See the results: https://lnkd.in/eqSNvsTa Which method for YOU? Our new "Selecting a Quantization Method" guide helps you decide! We compare: On-the-fly (Easy): bitsandbytes, HQQ, torchao - No calibration needed. Calibration-based (High Accuracy): AWQ, GPTQ - Need data, potentially better results. Fine-tuning: QLoRA via bitsandbytes is the standard. Specific Formats: Loading FP8/Sparse via compressed-tensors. https://lnkd.in/eRVyQsAW
-
𝐒𝐩𝐢𝐧𝐐𝐮𝐚𝐧𝐭: 𝐋𝐋𝐌 𝐐𝐮𝐚𝐧𝐭𝐢𝐳𝐚𝐭𝐢𝐨𝐧 𝐰𝐢𝐭𝐡 𝐋𝐞𝐚𝐫𝐧𝐞𝐝 𝐑𝐨𝐭𝐚𝐭𝐢𝐨𝐧𝐬 Post-training quantization (PTQ) techniques greatly reduce memory usage, latency, and power consumption of Large Language Models (LLMs), but may lead to large quantization errors when outliers are present. Rotating activation or weight matrices helps remove outliers and benefits quantization. In this work, we identify a collection of applicable rotation parameterizations that lead to identical outputs in full-precision Transformer architectures, with up to 13 points difference in downstream zero-shot reasoning performance. We propose 𝐒𝐩𝐢𝐧𝐐𝐮𝐚𝐧𝐭 that learns the rotation matrices with Cayley optimization. With 4-bit quantization of weight, activation, and KV-cache, SpinQuant narrows the accuracy gap on zero-shot reasoning tasks with full precision to merely 2.9 points on the LLaMA-2 7B model, surpassing LLM-QAT by 19.1 points and SmoothQuant by 25.0 points. Paper: https://lnkd.in/gY2Umj-N With Zechun Liu, Changsheng Zhao, Igor Fedorov, Bilge Soran, Dhruv Choudhary, Raghuraman Krishnamoorthi, Yuandong Tian, Tijmen Blankevoort
-
📝 Announcing QuickSilver, a runtime-only, token-level framework that accelerates LLM inference by exploiting semantic redundancy through halting, memory skipping, token fusion, and precision adaptation -- without retraining or architectural changes. 🔹 "𝐐𝐮𝐢𝐜𝐤𝐒𝐢𝐥𝐯𝐞𝐫 — 𝐒𝐩𝐞𝐞𝐝𝐢𝐧𝐠 𝐮𝐩 𝐋𝐋𝐌 𝐈𝐧𝐟𝐞𝐫𝐞𝐧𝐜𝐞 𝐭𝐡𝐫𝐨𝐮𝐠𝐡 𝐃𝐲𝐧𝐚𝐦𝐢𝐜 𝐓𝐨𝐤𝐞𝐧 𝐇𝐚𝐥𝐭𝐢𝐧𝐠, 𝐊𝐕 𝐒𝐤𝐢𝐩𝐩𝐢𝐧𝐠, 𝐂𝐨𝐧𝐭𝐞𝐱𝐭𝐮𝐚𝐥 𝐓𝐨𝐤𝐞𝐧 𝐅𝐮𝐬𝐢𝐨𝐧, 𝐚𝐧𝐝 𝐀𝐝𝐚𝐩𝐭𝐢𝐯𝐞 𝐌𝐚𝐭𝐫𝐲𝐨𝐬𝐡𝐤𝐚 𝐐𝐮𝐚𝐧𝐭𝐢𝐳𝐚𝐭𝐢𝐨𝐧" 🔹 In collaboration with Manipal University Jaipur, Vellore Institute of Technology, National Institute of Technology Silchar, Harrisburg University of Science and Technology, Meta, Indian Institute of Science Education & Research (IISER), Kolkata, Birla Institute of Technology and Science, Pilani Goa. 🔹 Paper: https://lnkd.in/gpZQKMmP ➡️ 𝐊𝐞𝐲 𝐇𝐢𝐠𝐡𝐥𝐢𝐠𝐡𝐭𝐬 𝐨𝐟 𝐐𝐮𝐢𝐜𝐤𝐒𝐢𝐥𝐯𝐞𝐫’𝐬 𝐑𝐮𝐧𝐭𝐢𝐦𝐞 𝐈𝐧𝐟𝐞𝐫𝐞𝐧𝐜𝐞 𝐅𝐫𝐚𝐦𝐞𝐰𝐨𝐫𝐤: 🧠 𝑫𝒚𝒏𝒂𝒎𝒊𝒄 𝑻𝒐𝒌𝒆𝒏 𝑯𝒂𝒍𝒕𝒊𝒏𝒈 & 𝑲𝑽 𝑪𝒂𝒄𝒉𝒆 𝑺𝒌𝒊𝒑𝒑𝒊𝒏𝒈: Halts forward computation for converged tokens using L2 representational drift and suppresses attention KV cache updates, achieving fine-grained compute savings without architectural change. 🔗 𝑪𝒐𝒏𝒕𝒆𝒙𝒕𝒖𝒂𝒍 𝑻𝒐𝒌𝒆𝒏 𝑭𝒖𝒔𝒊𝒐𝒏: Merges semantically redundant tokens based on hidden state similarity, reducing sequence length dynamically while preserving syntax and semantics through proximity-constrained averaging. ⚙️ 𝑨𝒅𝒂𝒑𝒕𝒊𝒗𝒆 𝑴𝒂𝒕𝒓𝒚𝒐𝒔𝒉𝒌𝒂 𝑸𝒖𝒂𝒏𝒕𝒊𝒛𝒂𝒕𝒊𝒐𝒏: Allocates per-token bit-width (2/4/8-bit) based on entropy computed mid-network, scaling memory and compute to token uncertainty for efficient precision adaptation. ✍🏼 Authors: Danush Khanna, Aditya Kumar Guru, Srivarshinee S, Zidan Ahmed, Rubhav Bahirwani, Meetu Malhotra, Vinija Jain, Aman Chadha, Dr. Amitava Das, Kripabandhu Ghosh