As multi-modal models become the norm I think we are going to see a lot more impact from these "old school" adversarial ML attacks. If your whole AI red team approach is natural language prompt injection and you don't understand how these models actually work you are going to be missing a large attack surface.
The old magic still works. It's not all prompts... there are still gradients to follow and decision boundaries to jump. I talk about PGD here, but also succeeded with C&W and HSJ.
Here's my latest from the NVIDIA AI Red Team: https://lnkd.in/gfiziis6
OpenAI just detached from the Nvidia lifeline for inference, and the global model war has shifted from "biggest parameters" to "lowest latency." While Silicon Valley slept, Beijing and New Delhi executed massive releases that redefine cost-efficiency and sovereign data control.
1. OpenAI Breaks Hardware Monogamy with Cerebras Deployment
In a strategic pivot away from total Nvidia dependency, OpenAI has rolled out GPT-5.3-Codex-Spark exclusively on Cerebras Wafer-Scale Engine 3 (WSE-3) chips. This research preview is designed specifically for "interruptible" coding workflows.
Technical Impact: By utilizing wafer-scale integration (massive on-chip memory), OpenAI has bypassed the memory bandwidth bottlenecks typical of GPU clusters. The result is over 1,000 tokens per second per stream. This shifts the bottleneck from model inference time to the developer's ability to read code.
2. MiniMax M2.5: The Price-Performance Assassin
Chinese lab MiniMax dropped M2.5 (Open Weights) and its "Lightning" variant, claiming a massive victory in cost efficiency. The model boasts an 80.2% score on SWE-Bench Verified, putting it within striking distance of closed-source western giants.
Technical Impact: The model utilizes a highly aggressive Mixture-of-Experts (MoE) architecture optimized for "agent-native" tasks via extensive Reinforcement Learning (RL). The architectural efficiency allows them to offer pricing at roughly $1/hour for continuous 100 tok/sec usage, a figure that undercuts AWS Bedrock and Azure pricing structures significantly.
3. India's Sarvam Wins on Regional Specialization
Ahead of today's AI Impact Summit in New Delhi, Indian startup Sarvam released Saaras V3. The model has successfully outperformed Gemini 3 Pro and Deepgram on the IndicVoices and olmOCR-Bench (84.3% accuracy) benchmarks.
Technical Impact: This proves that generalized foundational models (like GPT-5 series or Gemini) suffer from the "curse of multilinguality" where tokenization remains inefficient for non-Latin scripts. Sarvam’s specialized encoders for Indian languages and OCR noise reduction provide higher fidelity at a fraction of the parameter count of US models.
4. ByteDance Seedance 2.0 vs. Hollywood
ByteDance released Seedance 2.0, a text-to-video model capable of generating "cinematic" outputs in seconds. The release was met with immediate legal threats from US entertainment guilds (MPA, SAG-AFTRA) regarding copyright data usage.
Technical Impact: While the diffusion pipelines have improved for temporal consistency, the real story is the lack of "safety filters" compared to OpenAI’s Sora or Google’s Veo. Seedance 2.0 appears to run with minimal RAG-based censorship layers, allowing for faster generation but significantly higher liability.
#CerebrasWSE3#InferenceLatency#SovereignAI#GenerativeAI#TechNews
Pure Storage integrates its KVA with NVIDIA Dynamo to speed up #AI inference and reduce latency for large language models. This helps enterprises run #AIWorkloads faster and receives results in seconds. The integration helps enterprises run more control over their AI pipelines and helps them scale projects with confidence. #DataAcceleration
Read the article to learn more about the integration: https://hubs.la/Q040k1kc0
Founder & CTO | Building and teaching AI Agents | CITA Member | France’s Top 2% voice in AI
🚨 Breaking: First Andrew Ng, now NVIDIA’s own researchers are backing the same two conclusions:
- Small Language models are better than LLMs for Agentic AI
- Better orchestratrion leads to cost efficient AI Agents
Here are two papers explaining this 👇
📌 Paper 1: "SLM is the Future of Agentic AI"
- NVIDIA argues that agents don't need general conversational genius 100% of the time.
- Actual agentic workflows are mostly narrow tasks :
• "Format this JSON"
• "Call the weather API"
• "Check if this date is valid"
The SLM Advantage
- For these tasks, Small Language Models (SLMs <10B params) are not just "good enough", they are better
• Latency: They react instantly (critical for multi-step agents)
• Cost: 10-30x cheaper per token
• Control: Easier to fine-tune for strict protocols than a stubborn giant model
But wait...
"If we use small models, won't the agent get dumber?"
This is where the second paper changes the game
"Intelligence isn't just raw parameter count; it's about organization"
🔗 https://lnkd.in/dajzm3TV
📌 Paper 2: "ToolOrchestra"
NVIDIA introduced Orchestrator-8B
It’s not a genius at everything, but it’s a genius at management.
It acts as a "Router"
It analyzes a request and decides: "Do I need the expensive reasoning model for this? Or can I use a cheap calculator tool?"
- The Results are Wild
They tested this on the "Humanity's Last Exam" (HLE) benchmark.
• GPT-5 (Monolithic): 35.1% accuracy
• Orchestrator-8B (Router): 37.1% accuracy
The kicker?
The Orchestrator system was 2.5x more efficient and cost ~30% of the monolithic baseline
TLDR:
Building "Agentic Systems" by just wrapping a prompt around a massive API is a financial dead end
As usage scales, your margins vanish
Orchestrated approach keeps intelligence high but slashes the compute bill by ~70%
• Monolithic Agents: Expensive, slow, wasteful
• The NVIDIA Way: Orchestrate specialized SLMs.
🔗 https://lnkd.in/dJc8-Res
📌 Key Insight: Intelligence is about routing to the right tool, not being the tool
🚨 Breaking: First Andrew Ng, now NVIDIA’s own researchers are backing the same two conclusions:
- Small Language models are better than LLMs for Agentic AI
- Better orchestratrion leads to cost efficient AI Agents
Here are two papers explaining this 👇
📌 Paper 1: "SLM is the Future of Agentic AI"
- NVIDIA argues that agents don't need general conversational genius 100% of the time.
- Actual agentic workflows are mostly narrow tasks :
• "Format this JSON"
• "Call the weather API"
• "Check if this date is valid"
The SLM Advantage
- For these tasks, Small Language Models (SLMs <10B params) are not just "good enough", they are better
• Latency: They react instantly (critical for multi-step agents)
• Cost: 10-30x cheaper per token
• Control: Easier to fine-tune for strict protocols than a stubborn giant model
But wait...
"If we use small models, won't the agent get dumber?"
This is where the second paper changes the game
"Intelligence isn't just raw parameter count; it's about organization"
🔗 https://lnkd.in/dajzm3TV
📌 Paper 2: "ToolOrchestra"
NVIDIA introduced Orchestrator-8B
It’s not a genius at everything, but it’s a genius at management.
It acts as a "Router"
It analyzes a request and decides: "Do I need the expensive reasoning model for this? Or can I use a cheap calculator tool?"
- The Results are Wild
They tested this on the "Humanity's Last Exam" (HLE) benchmark.
• GPT-5 (Monolithic): 35.1% accuracy
• Orchestrator-8B (Router): 37.1% accuracy
The kicker?
The Orchestrator system was 2.5x more efficient and cost ~30% of the monolithic baseline
TLDR:
Building "Agentic Systems" by just wrapping a prompt around a massive API is a financial dead end
As usage scales, your margins vanish
Orchestrated approach keeps intelligence high but slashes the compute bill by ~70%
• Monolithic Agents: Expensive, slow, wasteful
• The NVIDIA Way: Orchestrate specialized SLMs.
🔗 https://lnkd.in/dJc8-Res
📌 Key Insight: Intelligence is about routing to the right tool, not being the tool
I really don't understand why most enterprises are consuming so costly tokens, using OpenAI and Claude's big models to build Agents.
Small reasoning and language models can be deployed to handle sub-tasks in Orchestrator-Worker framework.
Why they can't deploy small open-source models in aws and try to make the orchestration intelligent enough (following the RL concepts).
If the Agentic AI momentum keeps building just on models by some superpowers, they are just powering the centralisation of data to a few hands and then of course the CO2 released, and India becomes their Datacentre.
Founder & CTO | Building and teaching AI Agents | CITA Member | France’s Top 2% voice in AI
🚨 Breaking: First Andrew Ng, now NVIDIA’s own researchers are backing the same two conclusions:
- Small Language models are better than LLMs for Agentic AI
- Better orchestratrion leads to cost efficient AI Agents
Here are two papers explaining this 👇
📌 Paper 1: "SLM is the Future of Agentic AI"
- NVIDIA argues that agents don't need general conversational genius 100% of the time.
- Actual agentic workflows are mostly narrow tasks :
• "Format this JSON"
• "Call the weather API"
• "Check if this date is valid"
The SLM Advantage
- For these tasks, Small Language Models (SLMs <10B params) are not just "good enough", they are better
• Latency: They react instantly (critical for multi-step agents)
• Cost: 10-30x cheaper per token
• Control: Easier to fine-tune for strict protocols than a stubborn giant model
But wait...
"If we use small models, won't the agent get dumber?"
This is where the second paper changes the game
"Intelligence isn't just raw parameter count; it's about organization"
🔗 https://lnkd.in/dajzm3TV
📌 Paper 2: "ToolOrchestra"
NVIDIA introduced Orchestrator-8B
It’s not a genius at everything, but it’s a genius at management.
It acts as a "Router"
It analyzes a request and decides: "Do I need the expensive reasoning model for this? Or can I use a cheap calculator tool?"
- The Results are Wild
They tested this on the "Humanity's Last Exam" (HLE) benchmark.
• GPT-5 (Monolithic): 35.1% accuracy
• Orchestrator-8B (Router): 37.1% accuracy
The kicker?
The Orchestrator system was 2.5x more efficient and cost ~30% of the monolithic baseline
TLDR:
Building "Agentic Systems" by just wrapping a prompt around a massive API is a financial dead end
As usage scales, your margins vanish
Orchestrated approach keeps intelligence high but slashes the compute bill by ~70%
• Monolithic Agents: Expensive, slow, wasteful
• The NVIDIA Way: Orchestrate specialized SLMs.
🔗 https://lnkd.in/dJc8-Res
📌 Key Insight: Intelligence is about routing to the right tool, not being the tool
very informative Small LLMs act like the fast working memory of an agent, while big LLMs act like the deep thinker.
They handle repetitive, frequent, low-risk tasks: tool selection, routing, classification, summarization, extracting fields, maintaining conversation state, and validating outputs. Because they’re cheap and fast, the agent can think step-by-step without calling an expensive model every time.
The large model is only called for hard reasoning.
So:
Small LLM = reflexes
Big LLM = brain
This makes agents faster, cheaper, and more reliable.
Founder & CTO | Building and teaching AI Agents | CITA Member | France’s Top 2% voice in AI
🚨 Breaking: First Andrew Ng, now NVIDIA’s own researchers are backing the same two conclusions:
- Small Language models are better than LLMs for Agentic AI
- Better orchestratrion leads to cost efficient AI Agents
Here are two papers explaining this 👇
📌 Paper 1: "SLM is the Future of Agentic AI"
- NVIDIA argues that agents don't need general conversational genius 100% of the time.
- Actual agentic workflows are mostly narrow tasks :
• "Format this JSON"
• "Call the weather API"
• "Check if this date is valid"
The SLM Advantage
- For these tasks, Small Language Models (SLMs <10B params) are not just "good enough", they are better
• Latency: They react instantly (critical for multi-step agents)
• Cost: 10-30x cheaper per token
• Control: Easier to fine-tune for strict protocols than a stubborn giant model
But wait...
"If we use small models, won't the agent get dumber?"
This is where the second paper changes the game
"Intelligence isn't just raw parameter count; it's about organization"
🔗 https://lnkd.in/dajzm3TV
📌 Paper 2: "ToolOrchestra"
NVIDIA introduced Orchestrator-8B
It’s not a genius at everything, but it’s a genius at management.
It acts as a "Router"
It analyzes a request and decides: "Do I need the expensive reasoning model for this? Or can I use a cheap calculator tool?"
- The Results are Wild
They tested this on the "Humanity's Last Exam" (HLE) benchmark.
• GPT-5 (Monolithic): 35.1% accuracy
• Orchestrator-8B (Router): 37.1% accuracy
The kicker?
The Orchestrator system was 2.5x more efficient and cost ~30% of the monolithic baseline
TLDR:
Building "Agentic Systems" by just wrapping a prompt around a massive API is a financial dead end
As usage scales, your margins vanish
Orchestrated approach keeps intelligence high but slashes the compute bill by ~70%
• Monolithic Agents: Expensive, slow, wasteful
• The NVIDIA Way: Orchestrate specialized SLMs.
🔗 https://lnkd.in/dJc8-Res
📌 Key Insight: Intelligence is about routing to the right tool, not being the tool
You don't just pop in antibiotics for a minor headache inconvenience. Sounds about right on the logic of the assumption, let alone backed by two industry giants.
Founder & CTO | Building and teaching AI Agents | CITA Member | France’s Top 2% voice in AI
🚨 Breaking: First Andrew Ng, now NVIDIA’s own researchers are backing the same two conclusions:
- Small Language models are better than LLMs for Agentic AI
- Better orchestratrion leads to cost efficient AI Agents
Here are two papers explaining this 👇
📌 Paper 1: "SLM is the Future of Agentic AI"
- NVIDIA argues that agents don't need general conversational genius 100% of the time.
- Actual agentic workflows are mostly narrow tasks :
• "Format this JSON"
• "Call the weather API"
• "Check if this date is valid"
The SLM Advantage
- For these tasks, Small Language Models (SLMs <10B params) are not just "good enough", they are better
• Latency: They react instantly (critical for multi-step agents)
• Cost: 10-30x cheaper per token
• Control: Easier to fine-tune for strict protocols than a stubborn giant model
But wait...
"If we use small models, won't the agent get dumber?"
This is where the second paper changes the game
"Intelligence isn't just raw parameter count; it's about organization"
🔗 https://lnkd.in/dajzm3TV
📌 Paper 2: "ToolOrchestra"
NVIDIA introduced Orchestrator-8B
It’s not a genius at everything, but it’s a genius at management.
It acts as a "Router"
It analyzes a request and decides: "Do I need the expensive reasoning model for this? Or can I use a cheap calculator tool?"
- The Results are Wild
They tested this on the "Humanity's Last Exam" (HLE) benchmark.
• GPT-5 (Monolithic): 35.1% accuracy
• Orchestrator-8B (Router): 37.1% accuracy
The kicker?
The Orchestrator system was 2.5x more efficient and cost ~30% of the monolithic baseline
TLDR:
Building "Agentic Systems" by just wrapping a prompt around a massive API is a financial dead end
As usage scales, your margins vanish
Orchestrated approach keeps intelligence high but slashes the compute bill by ~70%
• Monolithic Agents: Expensive, slow, wasteful
• The NVIDIA Way: Orchestrate specialized SLMs.
🔗 https://lnkd.in/dJc8-Res
📌 Key Insight: Intelligence is about routing to the right tool, not being the tool
The smartest agent isn’t the biggest - it’s the best organized.
Orchestrator-8B beats GPT-5 on accuracy at 30% of the cost. Scale ≠ intelligence. Routing = intelligence.
Founder & CTO | Building and teaching AI Agents | CITA Member | France’s Top 2% voice in AI
🚨 Breaking: First Andrew Ng, now NVIDIA’s own researchers are backing the same two conclusions:
- Small Language models are better than LLMs for Agentic AI
- Better orchestratrion leads to cost efficient AI Agents
Here are two papers explaining this 👇
📌 Paper 1: "SLM is the Future of Agentic AI"
- NVIDIA argues that agents don't need general conversational genius 100% of the time.
- Actual agentic workflows are mostly narrow tasks :
• "Format this JSON"
• "Call the weather API"
• "Check if this date is valid"
The SLM Advantage
- For these tasks, Small Language Models (SLMs <10B params) are not just "good enough", they are better
• Latency: They react instantly (critical for multi-step agents)
• Cost: 10-30x cheaper per token
• Control: Easier to fine-tune for strict protocols than a stubborn giant model
But wait...
"If we use small models, won't the agent get dumber?"
This is where the second paper changes the game
"Intelligence isn't just raw parameter count; it's about organization"
🔗 https://lnkd.in/dajzm3TV
📌 Paper 2: "ToolOrchestra"
NVIDIA introduced Orchestrator-8B
It’s not a genius at everything, but it’s a genius at management.
It acts as a "Router"
It analyzes a request and decides: "Do I need the expensive reasoning model for this? Or can I use a cheap calculator tool?"
- The Results are Wild
They tested this on the "Humanity's Last Exam" (HLE) benchmark.
• GPT-5 (Monolithic): 35.1% accuracy
• Orchestrator-8B (Router): 37.1% accuracy
The kicker?
The Orchestrator system was 2.5x more efficient and cost ~30% of the monolithic baseline
TLDR:
Building "Agentic Systems" by just wrapping a prompt around a massive API is a financial dead end
As usage scales, your margins vanish
Orchestrated approach keeps intelligence high but slashes the compute bill by ~70%
• Monolithic Agents: Expensive, slow, wasteful
• The NVIDIA Way: Orchestrate specialized SLMs.
🔗 https://lnkd.in/dJc8-Res
📌 Key Insight: Intelligence is about routing to the right tool, not being the tool
Great insight. Using expensive models everywhere increases latency and cost. Depending on the task, a SLM router plus regex-based tools can often deliver the same result, faster and cheaper.
Founder & CTO | Building and teaching AI Agents | CITA Member | France’s Top 2% voice in AI
🚨 Breaking: First Andrew Ng, now NVIDIA’s own researchers are backing the same two conclusions:
- Small Language models are better than LLMs for Agentic AI
- Better orchestratrion leads to cost efficient AI Agents
Here are two papers explaining this 👇
📌 Paper 1: "SLM is the Future of Agentic AI"
- NVIDIA argues that agents don't need general conversational genius 100% of the time.
- Actual agentic workflows are mostly narrow tasks :
• "Format this JSON"
• "Call the weather API"
• "Check if this date is valid"
The SLM Advantage
- For these tasks, Small Language Models (SLMs <10B params) are not just "good enough", they are better
• Latency: They react instantly (critical for multi-step agents)
• Cost: 10-30x cheaper per token
• Control: Easier to fine-tune for strict protocols than a stubborn giant model
But wait...
"If we use small models, won't the agent get dumber?"
This is where the second paper changes the game
"Intelligence isn't just raw parameter count; it's about organization"
🔗 https://lnkd.in/dajzm3TV
📌 Paper 2: "ToolOrchestra"
NVIDIA introduced Orchestrator-8B
It’s not a genius at everything, but it’s a genius at management.
It acts as a "Router"
It analyzes a request and decides: "Do I need the expensive reasoning model for this? Or can I use a cheap calculator tool?"
- The Results are Wild
They tested this on the "Humanity's Last Exam" (HLE) benchmark.
• GPT-5 (Monolithic): 35.1% accuracy
• Orchestrator-8B (Router): 37.1% accuracy
The kicker?
The Orchestrator system was 2.5x more efficient and cost ~30% of the monolithic baseline
TLDR:
Building "Agentic Systems" by just wrapping a prompt around a massive API is a financial dead end
As usage scales, your margins vanish
Orchestrated approach keeps intelligence high but slashes the compute bill by ~70%
• Monolithic Agents: Expensive, slow, wasteful
• The NVIDIA Way: Orchestrate specialized SLMs.
🔗 https://lnkd.in/dJc8-Res
📌 Key Insight: Intelligence is about routing to the right tool, not being the tool