Day 19/30 of SLMs/LLMs: Mixture-of-Experts, Efficient Transformers, and Sparse Models As language models grow larger, two challenges dominate: cost and efficiency. Bigger models bring higher accuracy but also higher latency, energy use, and deployment complexity. The next phase of progress is about making models faster, lighter, and more intelligent per parameter. A leading direction is the Mixture-of-Experts (MoE) architecture. Instead of activating every parameter for each input, MoE models route tokens through a few specialized “experts.” Google’s Switch Transformer and DeepMind’s GLaM demonstrated that activating only 5 to 10 percent of weights can achieve the same accuracy as dense models at a fraction of the compute. Open models like Mixtral 8x7B extend this idea by using eight experts per layer but activating only two for each forward pass. The result is performance similar to a 70B model while operating at roughly 12B compute cost. Another active area of innovation is Efficient Transformers. Traditional attention scales quadratically with sequence length, which limits how much context a model can process. New variants such as FlashAttention, Longformer, Performer, and Mamba improve memory efficiency and speed. FlashAttention in particular accelerates attention calculations by performing them directly in GPU memory, achieving two to four times faster throughput on long sequences. Sparse Models also contribute to efficiency by reducing the number of active parameters during training or inference. Structured sparsity, combined with quantization and pruning, allows models to run on smaller devices without a major loss in quality. Advances in sparsity-aware optimizers now make it possible to deploy billion-parameter models on standard hardware with near state-of-the-art accuracy. These techniques share a single goal: scaling intelligence without scaling cost. The focus is shifting from building larger networks to building smarter ones. A 7B model that uses retrieval, sparse activation, and efficient attention can outperform a much larger dense model in both speed and reliability.
Scaling Large Language Models With Optimized Activation Usage
Explore top LinkedIn content from expert professionals.
Summary
Scaling large language models with optimized activation usage means making AI models smarter and faster by carefully managing which parts of the model are used during training and inference. Instead of always using all parts of a large model, new approaches selectively activate certain components and external memory, saving on computation and memory while still delivering great results.
- Use selective activation: Choose specialized model components to handle each task, reducing unnecessary computation and improving speed for both training and deployment.
- Prioritize memory efficiency: Apply strategies like offloading activations or compressing intermediate steps so you can train and run large models on standard hardware without sacrificing quality.
- Adopt smarter retrieval: Integrate external memory and retrieval mechanisms so the model recalls facts quickly, freeing up resources for complex reasoning and making large models more practical for real-world use.
-
-
VeLoRA Memory Efficient Training using Rank-1 Sub-Token Projections Large language models (LLMs) have recently emerged as powerful tools for tackling many language-processing tasks. Despite their success, training and fine-tuning these models is still far too computationally and memory intensive. In this paper, we identify and characterise the important components needed for effective model convergence using gradient descent. In doing so we find that the intermediate activations used to implement backpropagation can be excessively compressed without incurring any degradation in performance. This result leads us to a cheap and memory-efficient algorithm for both fine-tuning and pre-training LLMs. The proposed algorithm simply divides the tokens up into smaller sub-tokens before projecting them onto a fixed 1-dimensional subspace during the forward pass. These features are then coarsely reconstructed during the backward pass to implement the update rules. We confirm the effectiveness of our algorithm as being complimentary to many state-of-the-art PEFT methods on the VTAB-1k fine-tuning benchmark. Furthermore, we outperform QLoRA for fine-tuning LLaMA and show competitive performance against other memory-efficient pre-training methods on the large-scale C4 dataset.
-
Groundbreaking Research Alert: Making LLMs More Efficient with Smart Retrieval A fascinating paper from NAVER LABS Europe introduces a novel approach to optimize Large Language Models' retrieval mechanisms. The research shows how we can reduce retrieval operations by over 50% while maintaining or even improving performance. Key Technical Insights: - The system uses an "I Know" (IK) classifier that achieves 80% accuracy in determining when an LLM needs external knowledge - Only 32 tokens from the initial response are needed to make this determination - Training requires just 20,000 samples to achieve optimal performance - The approach works across multiple model families including Mistral, Llama, Gemma, and SOLAR Under the hood: - The system employs an LLM-as-judge architecture for training data generation - It uses adapters for fine-tuning larger models (7B+) - The IK score is computed using softmax on Yes/No token logits - Processing time is remarkably efficient: 3.7ms for IK classification, 8.3ms for generating 32 tokens Real-world Impact: - Reduces RAG processing time by up to 80% - Improves efficiency across various datasets including NQ, ASQA, HotpotQA - Particularly effective for general knowledge datasets like TriviaQA and SCIQ This research represents a significant step forward in making LLMs more efficient and practical for real-world applications. The ability to selectively activate retrieval mechanisms could be a game-changer for deployment at scale.
-
𝗔𝗜 𝗠𝗼𝗱𝗲𝗹𝘀 𝗔𝗿𝗲 𝗚𝗲𝘁𝘁𝗶𝗻𝗴 𝗗𝘂𝗺𝗯𝗲𝗿. 𝗕𝗲𝗰𝗮𝘂𝘀𝗲 𝗪𝗲’𝗿𝗲 𝗠𝗮𝗸𝗶𝗻𝗴 𝗧𝗵𝗲𝗺 𝗦𝗺𝗮𝗿𝘁𝗲𝗿 𝘁𝗵𝗲 𝗪𝗿𝗼𝗻𝗴 𝗪𝗮𝘆 Large Language Models (LLMs) like ChatGPT still use the same expensive computation to recall simple facts (“What’s the capital of France?”) as they do to solve a physics proof. This is like recalculating 2+2 from scratch every time you need it. It’s not just inefficient, it’s structurally broken. 𝗘𝗻𝗴𝗿𝗮𝗺: 𝗖𝗼𝗻𝗱𝗶𝘁𝗶𝗼𝗻𝗮𝗹 𝗠𝗲𝗺𝗼𝗿𝘆 𝗳𝗼𝗿 𝗟𝗟𝗠𝘀 Engram splits the model’s responsibilities like the human brain: - Neocortex: reasoning (Transformer backbone) - Hippocampus: memory lookup (Engram module) Instead of overloading neural layers with static knowledge, Engram lets LLMs look it up in O(1) from an external memory, just like we do. 𝗧𝗵𝗲 𝗥𝗲𝘀𝘂𝗹𝘁𝘀? 𝗦𝘂𝗿𝗽𝗿𝗶𝘀𝗶𝗻𝗴. Tested on a 27B-parameter model, Engram outperformed a same-sized MoE baseline on: - General reasoning (BBH: +5.0) - Long-context (Multi-Query NIAH: 84.2 → 97.0) - Code & math (HumanEval: +3.0) - Factual tasks (CMMLU: +4.0) - <3% latency overhead, even with 100B parameters offloaded to host memory. No compute bottlenecks. Just smarter design. 𝗧𝗮𝗸𝗲𝗮𝘄𝗮𝘆: We don’t need bigger models. We need better brains. By offloading memorization to memory, Engram frees up the model’s attention for actual reasoning. It’s an architectural shift, not just an optimization. Watch this space. This is a glimpse into the next generation of sparse models: modular, memory-augmented, and efficient by design. The future of LLMs might not be in scaling up, but scaling smarter. #AI #LLM #DeepLearning #Transformers #AIarchitecture #Sparsity #MachineLearning #NLP #MoE #Engram #DeepSeekAI #AIResearch #MemoryAugmentedModels #ThoughtLeadership
-
I think that LLM will continue to scale to trillions of parameters, therefore pipeline parallelism (PP) will remain a key strategy for efficient training. PipeOffload unlocks scalable pipeline parallelism with a memory optimization, removing the bottleneck observed in current activation memory overhead. Here’s why this matters: 🔹 Efficient Offloading: Empirical studies show that at least 50% and sometimes 100% of activation memory can be offloaded with negligible performance cost. 🔹 Selective Offload Strategy: When full offload isn’t feasible, prioritizing activations with longer lifespan drastically reduces peak memory, making PP more efficient. 🔹 Breakthrough in PP vs. Tensor Parallelism (TP): By integrating PipeOffload, pure PP becomes a stronger alternative to TP. It delivers up to 19% acceleration with lower memory use, making distributed training more efficient at scale. 🔹 Scalability Insights: With PipeOffload, per-device activation memory scales better, making PP more viable even as model sizes grow. We are now witnessing trade-offs in distributed training, making PP a first-class alternative to TP for large-scale AI workloads. The continued theme for LLMs continues to be more scalability, better performance, optimized computational and memory footprint. #genai #technology #artificialintelligence
-
[𝗖𝗼𝘂𝗻𝘁𝗱𝗼𝘄𝗻 𝘁𝗼 𝗔𝗖𝗟 - 𝗥𝗲𝘀𝗲𝗮𝗿𝗰𝗵 𝗗𝗲𝗲𝗽 𝗗𝗶𝘃𝗲] 🧠 How Do You Make Massive Mixture-of-Experts (MoEs) Actually Efficient? 𝗦𝘁𝗿𝘂𝗰𝘁𝘂𝗿𝗲𝗱 𝗧𝗵𝗲𝗻 𝗨𝗡𝘀𝘁𝗿𝘂𝗰𝘁𝘂𝗿𝗲𝗱 𝗣𝗿𝘂𝗻𝗶𝗻𝗴 ⚡️ Mixture-of-Experts (MoEs) help scale language models by only activating a few experts per input. But here’s the problem: even with sparse activation, you’re still hosting all the experts — and that’s expensive. Pruning seems like a natural solution. And traditionally, unstructured pruning (removing individual weights) beats structured pruning (removing whole components like experts) because it’s more flexible. But the new STUN method flips this idea on its head: ✂️ Prune entire experts first (structured), then prune individual weights (unstructured). ✅ The result is better performance than unstructured pruning alone. Why does this work? • STUN avoids the costly brute-force evaluation of which experts to prune. • Instead, it learns a latent structure — a similarity graph between experts — and prunes in a way that mimics global optimization. • Then it applies standard unstructured pruning within the experts that remain in a smaller optimization space. 🚀 The payoff: high sparsity (e.g. 40%) with minimal performance drop. Are you headed to ACL? Learn more about Snowflake’s presence: https://lnkd.in/ekgZ2RGe Association for Computational Linguistics
-
🚀 Brain-Inspired AI Breakthrough: SpikingBrain and the Future Beyond Transformers Over the past few years, Transformers have powered the rise of large language models (LLMs) like GPT, Gemini, Claude, and Llama. But as these models grow larger, they hit major bottlenecks: training costs soar, energy use skyrockets, and long-context reasoning remains painfully slow. A new paper just released on arXiv introduces something potentially game-changing: SpikingBrain – a family of brain-inspired large models (https://lnkd.in/eW9cmbSy). Why this matters Instead of relying purely on the Transformer formula, SpikingBrain borrows ideas from the human brain: ✨ Spiking neurons – mimicking how biological neurons fire only when needed, cutting energy waste. ⚡ Hybrid linear attention – compressing memory and processing ultra-long contexts far more efficiently. 🧩 Multi-scale sparsity – combining neuron-level sparsity with modular network design for speed and scalability. And here’s the kicker: 👉 These models were trained without NVIDIA GPUs, instead using a Chinese-developed MetaX GPU cluster. This is the first time brain-inspired LLMs have been scaled up on a non-NVIDIA platform, proving alternatives are possible. Key highlights: 🟢 SpikingBrain-7B: efficient linear model, handles up to 4M tokens with over 100× speedup in long-context inference. 🔵 SpikingBrain-76B: hybrid MoE model, rivaling Llama-70B and Mixtral-8×7B while using only ~2% of the training data normally required. 🌱 Energy savings: their spiking scheme cuts compute energy by up to 97% compared with standard methods. Why you should care This work is not just about one research paper—it’s about the future of AI efficiency, hardware independence, and innovation inspired by the brain. If validated and adopted, approaches like SpikingBrain could change the balance of the global AI ecosystem. This is just the beginning. In my upcoming articles, I’ll dive deeper into: 🔬 The technical side: How SpikingBrain really works and what makes it unique. 🌍 The big picture: What this means for NVIDIA’s dominance, new hardware ecosystems, and the global AI race. 👉 Follow me to stay tuned for the next posts.