🚀 𝗘𝗶𝗴𝗲𝗻 𝗔𝗜 𝗶𝘀 𝘁𝗵𝗲 𝗻𝗲𝘄 #𝟭 𝗼𝗻 𝘁𝗵𝗲 𝗔𝗿𝘁𝗶𝗳𝗶𝗰𝗶𝗮𝗹 𝗔𝗻𝗮𝗹𝘆𝘀𝗶𝘀 𝗞𝗶𝗺𝗶 𝗞𝟮.𝟲 𝗹𝗲𝗮𝗱𝗲𝗿𝗯𝗼𝗮𝗿𝗱 — 𝟮𝟲𝟱 𝘁𝗼𝗸𝗲𝗻𝘀/𝘀𝗲𝗰 — 𝐢𝐧 𝐜𝐨𝐥𝐥𝐚𝐛𝐨𝐫𝐚𝐭𝐢𝐨𝐧 𝐰𝐢𝐭𝐡 Nebius 𝐓𝐨𝐤𝐞𝐧 𝐅𝐚𝐜𝐭𝐨𝐫𝐲. We achieved it with B200 instead of GB300. Topping the leaderboard without Blackwell Ultra GPU isn't a silicon story — it's a serving stack story. Every layer of EigenInference is co-designed for trillion-parameter MoE workloads, and Kimi K2.6 is exactly where that pays off. A few of the ingredients: ⚡ A parallelism strategy tuned to K2.6's actual workload shape — not the conventional MoE playbook ⚡ Specialized attention kernels, separate paths for prefill and decode ⚡ Carefully tuned quantization strategy for weights and KV Cache ⚡ Speculative decoding with a specialized draft model When the math is this tight and the model is this big, every microsecond is earned. Super proud of what our team pulled off. More coming. Stay tuned. 🔥 https://lnkd.in/gzSsZ6dy #LLMInference #AIInfrastructure #Kimi #LLM #TokenEfficiency #NeoCloud
About us
Artificial Efficient Intelligence. AGI Tomorrow, AEI Today.
- Website
-
www.eigenai.com
External link for Eigen AI
- Industry
- Technology, Information and Internet
- Company size
- 11-50 employees
- Type
- Privately Held
Employees at Eigen AI
Updates
-
Today, we're announcing that Eigen AI is joining Nebius (NASDAQ: NBIS). From day one, our mission has been Artificial Efficient Intelligence — building the world's most efficient engines for generating intelligence. With Nebius, we're working toward the best AI cloud, uniting Eigen's industry-leading AI model customization and inference software, with Nebius's global hardware and infrastructure footprint, so any developer or enterprise can run the best models at the best price, with no capacity ceiling. After close, Eigen's optimization stack will be integrated directly into Nebius Token Factory. The entire Eigen AI team is joining Nebius in full, establishing Nebius's engineering and research presence in the San Francisco Bay Area. To our customers, our team, our investors at Tectonic Ventures, E14 Fund, Uncorrelated Ventures, and AGI House Ventures, our angel investors, advisors, mentors, and supporters — and to the Nebius team for the conviction and partnership — thank you. Ryan Hanrui Wang, co-founder and CEO of Eigen AI, said: “We’re proud to join Nebius and work alongside the Token Factory team to push the boundaries of inference performance. Nebius has built a world-class AI cloud with a deep engineering culture that perfectly aligns with our own. Together, we are removing the friction of AI model customization and deployment so developers can run models reliably in production without managing the underlying infrastructure.” Full announcement at: https://lnkd.in/gqpJKBvH
-
-
Eigen AI reposted this
Now available on EigenInference: 𝐍𝐕𝐈𝐃𝐈𝐀 𝐍𝐞𝐦𝐨𝐭𝐫𝐨𝐧™ 3 𝐍𝐚𝐧𝐨 𝐎𝐦𝐧𝐢 is live at the 𝐄𝐢𝐠𝐞𝐧 𝐀𝐈 𝐌𝐨𝐝𝐞𝐥 𝐒𝐭𝐮𝐝𝐢𝐨! 🔥 NVIDIA's new open multimodal foundation model — 30B-A3B hybrid Transformer-Mamba MoE with a 256K context, unified vision + audio + video + text input — is built to power the perception sub-agent inside enterprise agent systems. Through close collaboration with NVIDIA, we've quantized Nemotron 3 Nano Omni to NVFP4 and deployed it on NVIDIA Blackwell, delivering 500+ output tokens per second per user with no quality loss across multiple multimodal benchmarks vs. the BF16 baseline. ⚡️ Computer-use, document intelligence, and audio-video understanding agents — all in a single multimodal loop, at production throughput, available at launch. Explore how EigenInference brings production-ready performance from day one 👉 https://lnkd.in/g7V67CRQ Try it today 👉 https://lnkd.in/g46sB-dc #EigenAI #NVIDIA #Nemotron #NVFP4 #Blackwell #MultimodalAI #AgenticAI
-
Now available on EigenInference: 𝐍𝐕𝐈𝐃𝐈𝐀 𝐍𝐞𝐦𝐨𝐭𝐫𝐨𝐧™ 3 𝐍𝐚𝐧𝐨 𝐎𝐦𝐧𝐢 is live at the 𝐄𝐢𝐠𝐞𝐧 𝐀𝐈 𝐌𝐨𝐝𝐞𝐥 𝐒𝐭𝐮𝐝𝐢𝐨! 🔥 NVIDIA's new open multimodal foundation model — 30B-A3B hybrid Transformer-Mamba MoE with a 256K context, unified vision + audio + video + text input — is built to power the perception sub-agent inside enterprise agent systems. Through close collaboration with NVIDIA, we've quantized Nemotron 3 Nano Omni to NVFP4 and deployed it on NVIDIA Blackwell, delivering 500+ output tokens per second per user with no quality loss across multiple multimodal benchmarks vs. the BF16 baseline. ⚡️ Computer-use, document intelligence, and audio-video understanding agents — all in a single multimodal loop, at production throughput, available at launch. Explore how EigenInference brings production-ready performance from day one 👉 https://lnkd.in/g7V67CRQ Try it today 👉 https://lnkd.in/g46sB-dc #EigenAI #NVIDIA #Nemotron #NVFP4 #Blackwell #MultimodalAI #AgenticAI
-
𝐐𝐮𝐚𝐧𝐭𝐢𝐳𝐚𝐭𝐢𝐨𝐧 does not just affect accuracy. It can also 𝐡𝐢𝐭 𝐭𝐡𝐫𝐨𝐮𝐠𝐡𝐩𝐮𝐭 in ways that are easy to miss. ⚠️ One place where this shows up clearly is 𝐬𝐩𝐞𝐜𝐮𝐥𝐚𝐭𝐢𝐯𝐞 𝐝𝐞𝐜𝐨𝐝𝐢𝐧𝐠. 🔄 When you use a draft model, performance depends heavily on the acceptance rate. If the draft model stays close to the target model, you accept more tokens and get the speedup. But quantization makes this much more fragile than it looks. Even small quality drops can significantly reduce the acceptance rate. Once the draft and target models stop aligning, the hit rate can collapse, and speculative decoding stops working as intended. 📉 At that point, you fall back to full decoding much more often, and throughput drops sharply. So you end up with a double penalty: 1. 𝐋𝐨𝐰𝐞𝐫 𝐚𝐜𝐜𝐮𝐫𝐚𝐜𝐲 2. 𝐋𝐨𝐰𝐞𝐫 𝐭𝐡𝐫𝐨𝐮𝐠𝐡𝐩𝐮𝐭 👉 The key point is that quantization cannot be evaluated on its own. In real systems, it directly interacts with decoding. A small quality regression can easily turn into a much larger system level slowdown. #Quantization #Throughput #SpeculativeDecoding #AI #MachineLearning #EigenAI
-
-
Here's where things get tricky, these are patterns we've run into ourselves or seen come up repeatedly: 1. 𝐁𝐥𝐚𝐦𝐢𝐧𝐠 𝐪𝐮𝐚𝐧𝐭𝐢𝐳𝐚𝐭𝐢𝐨𝐧 𝐰𝐡𝐞𝐧 𝐢𝐭'𝐬 𝐚𝐜𝐭𝐮𝐚𝐥𝐥𝐲 𝐚 𝐬𝐲𝐬𝐭𝐞𝐦 𝐢𝐬𝐬𝐮𝐞. ⚠️ When accuracy drops, it’s easy to assume quantization is the cause. But in many cases, the issue comes from the serving system, like batching, padding, or runtime behavior, etc., causing numerical divergence. 2. 𝐁𝐥𝐚𝐦𝐢𝐧𝐠 𝐭𝐡𝐞 𝐦𝐨𝐝𝐞𝐥 𝐰𝐡𝐞𝐧 𝐢𝐭'𝐬 𝐚𝐜𝐭𝐮𝐚𝐥𝐥𝐲 𝐪𝐮𝐚𝐧𝐭𝐢𝐳𝐚𝐭𝐢𝐨𝐧 𝐨𝐟 𝐭𝐡𝐞 𝐊𝐕 𝐜𝐚𝐜𝐡𝐞. 🔍 A common pattern is good results on short prompts, then degradation in production. The issue may stem from KV cache quantization that wasn’t adequately tested at realistic sequence lengths. The model is fine, but the setup is not. 3. 𝐈𝐠𝐧𝐨𝐫𝐢𝐧𝐠 𝐜𝐚𝐥𝐢𝐛𝐫𝐚𝐭𝐢𝐨𝐧 𝐝𝐚𝐭𝐚 𝐪𝐮𝐚𝐥𝐢𝐭𝐲. 📊 The effectiveness of your quantization hinges on the quality of your calibration dataset. If your calibration data does not match your production distribution, you can get clean eval results and still see poor real-world performance. This gets misdiagnosed as "the model just doesn't work for our use case." 4. 𝐎𝐯𝐞𝐫-𝐭𝐫𝐮𝐬𝐭𝐢𝐧𝐠 𝐭𝐚𝐬𝐤-𝐛𝐚𝐬𝐞𝐝 𝐛𝐞𝐧𝐜𝐡𝐦𝐚𝐫𝐤 𝐝𝐞𝐥𝐭𝐚𝐬. 📉 Task-based benchmarks often employ all-or-nothing scoring, meaning a model that shifts from barely-right to barely-wrong loses a full point, despite the actual distributional shift is negligible. This also works in reverse that quantized models might outscore the FP16 baseline due to noise. It's easy to panic over a 1-point MMLU drop that could simply be measurement variance rather than a genuine quality regression. If you recognize these patterns early, you can avoid chasing the wrong problems. #Quantization #AI #MachineLearning #ModelOptimization #EigenAI
-
-
Some patterns show up again and again. If you see these, it’s very likely a quantization issue. 1. 𝐋𝐚𝐲𝐞𝐫-𝐒𝐩𝐞𝐜𝐢𝐟𝐢𝐜 𝐒𝐞𝐧𝐬𝐢𝐭𝐢𝐯𝐢𝐭𝐲 🧩 If accuracy drops sharply after quantizing attention projections or the initial/final layers, that’s usually a sign. These layers typically exhibit higher dynamic ranges and are more sensitive to precision, so issues often show up here first. 2. 𝐋𝐨𝐧𝐠-𝐂𝐨𝐧𝐭𝐞𝐱𝐭 𝐃𝐞𝐠𝐫𝐚𝐝𝐚𝐭𝐢𝐨𝐧 📚 If your model performs well with 2K tokens but begins to hallucinate or lose coherence at 32K+, this may signal accumulating KV cache quantization errors throughout the sequence. Monitoring performance across varying context lengths can help pinpoint these issues. 3. 𝐌𝐨𝐄 𝐑𝐨𝐮𝐭𝐢𝐧𝐠 𝐈𝐧𝐬𝐭𝐚𝐛𝐢𝐥𝐢𝐭𝐲 ⚙️ In Mixture-of-Experts models like DeepSeek or Qwen3, if expert selection becomes noisy or inconsistent after quantization, it’s usually a precision issue in the gating network rather than a problem with the model architecture. If you watch for these patterns early, you can usually spot quantization issues quickly and avoid wasting hours debugging the wrong thing. ⏳ #Quantization #AI #MachineLearning #ModelOptimization #EigenAI
-
-
This is one of those questions that sounds simple, but in practice it’s where teams lose a lot of time. We’ve seen people spend days tuning quantization, only to realize later the issue wasn’t quantization at all. So when performance starts to degrade, here’s what we actually check first.👇 First, always run the FP16 or BF16 baseline on the exact same inputs. It sounds obvious, but this step gets skipped surprisingly often. Without a clean baseline, everything else becomes guesswork. From there, we usually go through a few signals in order: 1. 𝐏𝐞𝐫𝐩𝐥𝐞𝐱𝐢𝐭𝐲 𝐃𝐞𝐥𝐭𝐚 📊 We use perplexity as a quick check. If it goes up by around 0.5–1% compared to the baseline on a representative evaluation set, that is already a warning sign. But a small delta does not mean you are safe. We have seen models pass this check and still fail on real tasks. 2. 𝐓𝐚𝐬𝐤-𝐒𝐩𝐞𝐜𝐢𝐟𝐢𝐜 𝐀𝐜𝐜𝐮𝐫𝐚𝐜𝐲 𝐨𝐧 𝐂𝐡𝐚𝐥𝐥𝐞𝐧𝐠𝐢𝐧𝐠 𝐁𝐞𝐧𝐜𝐡𝐦𝐚𝐫𝐤𝐬 🧠 Next, we look at more challenging tasks, which are benchmarks with multi-step reasoning or long workflows, such as HLE, T2-Bench, WorkArena, and CURE-Bench. This is usually where quantization issues first appear, since errors accumulate across steps. 3. 𝐎𝐮𝐭𝐩𝐮𝐭 𝐃𝐢𝐬𝐭𝐫𝐢𝐛𝐮𝐭𝐢𝐨𝐧 𝐃𝐢𝐯𝐞𝐫𝐠𝐞𝐧𝐜𝐞 📈 Instead of merely comparing final outputs, we analyze KL divergence across the top-N token probabilities between the FP16 reference and the quantized model. This approach reveals shifts in decision-making processes that perplexity and benchmark scores might miss. We also separate prefill and decode, since different quantization choices tend to impact them differently. 4. 𝐓𝐚𝐢𝐥 𝐁𝐞𝐡𝐚𝐯𝐢𝐨𝐫 𝐚𝐧𝐝 𝐅𝐚𝐢𝐥𝐮𝐫𝐞 𝐌𝐨𝐝𝐞𝐬 🚨 We pay close attention to issues like repetitive loops, early stopping, incoherent outputs on long-context inputs, or drops on rare and domain specific tokens. These patterns usually point to quantization issues rather than problems with the model itself. The main takeaway is that no single metric is enough. You need a mix of fast checks, task evaluation, and pattern recognition. Otherwise it is very easy to optimize the wrong thing. ❌ #Quantization #AI #MachineLearning #ModelOptimization #EigenAI
-
-
Most teams already know that inference breaks at scale. So why do they still get it wrong? In previous posts, we discussed how AI systems fail in production — from unstable latency to scheduling bottlenecks. But here’s the real issue: 👉 It’s not that teams don’t see the problem. It’s that they optimize for the wrong layer. Here’s what we keep seeing: 1️⃣ 𝐎𝐩𝐭𝐢𝐦𝐢𝐳𝐢𝐧𝐠 𝐟𝐨𝐫 𝐏𝐞𝐚𝐤 𝐏𝐞𝐫𝐟𝐨𝐫𝐦𝐚𝐧𝐜𝐞 — 𝐍𝐨𝐭 𝐒𝐭𝐚𝐛𝐢𝐥𝐢𝐭𝐲 Benchmarks reward max throughput, but production systems need to survive real traffic. What looks fast in isolation often becomes unstable under load. 2️⃣ 𝐓𝐫𝐞𝐚𝐭𝐢𝐧𝐠 𝐭𝐡𝐞 𝐌𝐨𝐝𝐞𝐥 𝐚𝐬 𝐭𝐡𝐞 𝐁𝐨𝐭𝐭𝐥𝐞𝐧𝐞𝐜𝐤 When things break, the instinct is to switch models, fine-tune, or optimize kernels. However, at scale, the bottleneck is often not the model, it’s the system surrounding it. 3️⃣ 𝐒𝐜𝐚𝐥𝐢𝐧𝐠 𝐇𝐚𝐫𝐝𝐰𝐚𝐫𝐞 𝐈𝐧𝐬𝐭𝐞𝐚𝐝 𝐨𝐟 𝐒𝐲𝐬𝐭𝐞𝐦𝐬 Adding more GPUs feels like progress. But without proper batching, scheduling, and memory management, more resources don’t translate into better performance. 4️⃣ 𝐀𝐬𝐬𝐮𝐦𝐢𝐧𝐠 𝐓𝐞𝐬𝐭𝐢𝐧𝐠 𝐑𝐞𝐟𝐥𝐞𝐜𝐭𝐬 𝐏𝐫𝐨𝐝𝐮𝐜𝐭𝐢𝐨𝐧 Offline evaluation ≠ real-world behavior. Systems that pass benchmarks often fail under concurrent requests, long-running workloads, and unpredictable traffic. At scale, these mistakes compound. Not because teams lack expertise — but because the system is being optimized at the wrong layer. 👉 The gap isn’t in the model, it’s in how the system is designed. That’s why these problems persist, and why they often don’t surface until systems hit real production traffic. Which is also why many AI products break after launch, not before. 💡 What strategies have you found effective in addressing these challenges? #AI #Inference #MachineLearning #Scalability #TechLeadership #SystemDesign #Performance
-