Surge AI Tests AI Models in Simulated Job Setting

This title was summarized by AI from the post below.

1mo Edited

Everyone’s building $100M "agentic" models, so Surge AI built a simulated company to see if they could actually hold down a job. Spoiler: they're all fired. Welcome to EnterpriseBench -- CoreCraft edition. CoreCraft is a high-growth hardware startup (i.e., RL environment) with 23 tools, 2500 entities, and enough corporate red tape to make Harvey cry. The best agent in the world (Opus 4.6! 👑) barely scored 30%. The #2 model (GPT-5.2 🥈) gave up because a search returned 10 results and it couldn't figure out how to change the date filter. Another one (Gemini 3 Flash, #9) literally made up a delivery date just to deny a customer's refund. Savage. (The new Gemini 3.1 Pro? Still lagging behind, at 🥉) My favorite: GPT-5.2 spent 11 tool calls curating a promotional email to help a customer reach Platinum tier... a tier she was already in. "Here are 3 items over $0 you can buy!" "We would obviously never run ads in the way Anthropic depicts them...." -- thanks Sam. The good news? We trained a model on this chaos and it got better at its job - even translating those skills to other benchmarks. (e.g., +7.4% on Tau2-Bench Retail) Check out the full EnterpriseBench: CoreCraft leaderboard below, and read about our RL environment and research! Blog post: https://lnkd.in/eE_r55J7 Paper: https://lnkd.in/e6jbDpcv Leaderboard: https://lnkd.in/eJ2w8CYV

EnterpriseBench: CoreCraft – An RL Environment for Enterprise Chaos surgehq.ai

19 Comments

Doug Skinner 1mo

Benchmarks have been grading agents like they’re taking a quiz. I've been looking for something like CoreCraft which grades them like they’re doing a job. 2,500+ entities + 23 tools + messy enterprise context. The frontier models still barely clear ~30% when the rubric is strict. That’s the real headline for me: “agentic” isn’t a model problem, it’s an environment + evaluation problem.

Andreas S. 1mo

"we trained a model on this chaos..." That's the moment the benchmark lost its value as a proxy for performance

Josh Porter ⚡️🐺 1w

Really interesting direction, Edwin. Environments like this feel important because they move RL from theory into something teams can actually experiment with and learn from. That bridge between research and real-world application is where a lot of progress tends to happen. Curious to see how people start using this and what kinds of behaviors or patterns emerge from it.

Miriam Kuemmel 1mo

You found out agents shouldn't run a hardware startup, we found out they shouldn't run a pharmacy: https://www.blueguardrails.com/en/blog/placebo-bench-an-llm-hallucination-benchmark-for-pharma We mainly went for Completeness, Correctness, and Terminological Precision. They all failed (26-64% Hallucination rate), but in a different order than in your benchmark (Opus came in last...), super interesting!

Jason Tan 1mo

How did you design the red tape scenarios to mirror real-world corporate friction?

Lubo Bali 1mo

30% score and they're already running companies 😂 we're cooked

1 Reaction

Seggi Mir 3w

Love it. Sounds like Frontier Labs are about to find they have another competitor. And when the competitor is also the supplier of data, things become interesting

Stewart Alsop 1mo

Edwin Chen I'm so curious about the fundraise and would like to interview you about your philosophy for the Crazy Wisdom podcast, I'm not sure this is actually your philosophy as I had AI write up the brief but I agree with it! Want to record an interview about it?

See more comments

To view or add a comment, sign in

More Relevant Posts

Nikki S.
1mo Edited
Report this post
🔥Spoiler: they’re all fired. 🚀Surge AI ran a simulation of AI agents operating a high-growth hardware startup. The result? Only 30% of tasks were barely solved by Anthropic Claude Opus-4.6 and OpenAI GPT-5.2. 😨The industry is spreading the fear of replacement. 🔍Yet the finding show the evolution is still far away. - Everyone is building ‘agentic’ tools, but very few understand the complexity of enterprise data infrastructure and decision-making processes. - From structured to unstructured data, from go to not-go calls, today’s AI cannot consolidate the fragments of how humans truly decide. ⚡That doesn’t mean we can relax. Frontier models are updating weekly. - As Ivan Zhao noted in Steam, Steel, and Infinite Minds: machines replaced labors in the past, now AI is beginning to replace the “infinite mind.” 💡The lesson is clear - those who thrive are not the ones who resist, but the ones who reinforce their capabilities with new tools. 🤔The real question is always how we can collaborate with AI, instead of fearing replacement.

Edwin Chen

Founder at Surge AI
1mo Edited

Everyone’s building $100M "agentic" models, so Surge AI built a simulated company to see if they could actually hold down a job. Spoiler: they're all fired. Welcome to EnterpriseBench -- CoreCraft edition. CoreCraft is a high-growth hardware startup (i.e., RL environment) with 23 tools, 2500 entities, and enough corporate red tape to make Harvey cry. The best agent in the world (Opus 4.6! 👑) barely scored 30%. The #2 model (GPT-5.2 🥈) gave up because a search returned 10 results and it couldn't figure out how to change the date filter. Another one (Gemini 3 Flash, #9) literally made up a delivery date just to deny a customer's refund. Savage. (The new Gemini 3.1 Pro? Still lagging behind, at 🥉) My favorite: GPT-5.2 spent 11 tool calls curating a promotional email to help a customer reach Platinum tier... a tier she was already in. "Here are 3 items over $0 you can buy!" "We would obviously never run ads in the way Anthropic depicts them...." -- thanks Sam. The good news? We trained a model on this chaos and it got better at its job - even translating those skills to other benchmarks. (e.g., +7.4% on Tau2-Bench Retail) Check out the full EnterpriseBench: CoreCraft leaderboard below, and read about our RL environment and research! Blog post: https://lnkd.in/eE_r55J7 Paper: https://lnkd.in/e6jbDpcv Leaderboard: https://lnkd.in/eJ2w8CYV

EnterpriseBench: CoreCraft – An RL Environment for Enterprise Chaos surgehq.ai
Like Comment
To view or add a comment, sign in
Prashant Varshney
4w
Report this post
If you're building with AI and still hand-coding everything from scratch, you're basically showing up to a Formula 1 race on a bicycle. These frameworks are the difference between "I built a chatbot" and "I built a system that actually scales." Here's what's dominating the landscape: 1. LangChain → The Swiss Army knife. Chains LLMs to external tools, vector stores, and APIs. Think of it as the plumbing that connects your brilliant ideas to actual data sources. 2. AutogenAI → Conversational genius. Human-in-the-loop isn't just a feature—it's the architecture. Your agents learn from real interactions and execute code dynamically. This is where things get spicy. 3. CrewAI → The orchestrator. Multiple agents, multiple tools, one cohesive workflow. If you're building anything that requires coordination between specialized AI tasks, this is your playground. 4. LlamaIndex → The document whisperer. Loads, parses, and indexes massive datasets into vector stores. Perfect when you need your LLM to actually *understand* your proprietary documents, not just guess. 5. Semantic Kernel → Microsoft's dark horse. Advanced search meets plugin architecture. It's structured, it's powerful, and it's criminally underrated for production systems. Here's the uncomfortable reality: the framework you choose shapes the problems you can solve. Pick the wrong one, and you'll spend months fighting your tools instead of building breakthroughs. The teams winning right now? They're not married to one framework. They're mixing and matching based on the problem using LlamaIndex for data prep, CrewAI for orchestration, and LangChain to tie it together. ♻️ Repost if this saves someone 100 hours of trial and error ➕ Follow me for more #AIEngineering #LangChain #AgenticAI #AIFrameworks #AgenticAI #AIAgents
31 Comments
Like Comment
To view or add a comment, sign in
Prashant Varshney
3w
Report this post
If you're building with AI and still hand-coding everything from scratch, you're basically showing up to a Formula 1 race on a bicycle. These frameworks are the difference between "I built a chatbot" and "I built a system that actually scales." Here's what's dominating the landscape: 1. LangChain → The Swiss Army knife. Chains LLMs to external tools, vector stores, and APIs. Think of it as the plumbing that connects your brilliant ideas to actual data sources. 2. AutogenAI → Conversational genius. Human-in-the-loop isn't just a feature—it's the architecture. Your agents learn from real interactions and execute code dynamically. This is where things get spicy. 3. CrewAI → The orchestrator. Multiple agents, multiple tools, one cohesive workflow. If you're building anything that requires coordination between specialized AI tasks, this is your playground. 4. LlamaIndex → The document whisperer. Loads, parses, and indexes massive datasets into vector stores. Perfect when you need your LLM to actually *understand* your proprietary documents, not just guess. 5. Semantic Kernel → Microsoft's dark horse. Advanced search meets plugin architecture. It's structured, it's powerful, and it's criminally underrated for production systems. Here's the uncomfortable reality: the framework you choose shapes the problems you can solve. Pick the wrong one, and you'll spend months fighting your tools instead of building breakthroughs. The teams winning right now? They're not married to one framework. They're mixing and matching based on the problem using LlamaIndex for data prep, CrewAI for orchestration, and LangChain to tie it together. ♻️ Repost if this saves someone 100 hours of trial and error ➕ Follow me for more #AIEngineering #LangChain #AgenticAI #AIFrameworks #AgenticAI #AIAgents
42 Comments
Like Comment
To view or add a comment, sign in
Prashant Varshney
2w
Report this post
If you're building with AI and still hand-coding everything from scratch, you're basically showing up to a Formula 1 race on a bicycle. These frameworks are the difference between "I built a chatbot" and "I built a system that actually scales." Here's what's dominating the landscape: 1. LangChain → The Swiss Army knife. Chains LLMs to external tools, vector stores, and APIs. Think of it as the plumbing that connects your brilliant ideas to actual data sources. 2. AutogenAI → Conversational genius. Human-in-the-loop isn't just a feature—it's the architecture. Your agents learn from real interactions and execute code dynamically. This is where things get spicy. 3. CrewAI → The orchestrator. Multiple agents, multiple tools, one cohesive workflow. If you're building anything that requires coordination between specialized AI tasks, this is your playground. 4. LlamaIndex → The document whisperer. Loads, parses, and indexes massive datasets into vector stores. Perfect when you need your LLM to actually *understand* your proprietary documents, not just guess. 5. Semantic Kernel → Microsoft's dark horse. Advanced search meets plugin architecture. It's structured, it's powerful, and it's criminally underrated for production systems. Here's the uncomfortable reality: the framework you choose shapes the problems you can solve. Pick the wrong one, and you'll spend months fighting your tools instead of building breakthroughs. The teams winning right now? They're not married to one framework. They're mixing and matching based on the problem using LlamaIndex for data prep, CrewAI for orchestration, and LangChain to tie it together. ♻️ Repost if this saves someone 100 hours of trial and error ➕ Follow me for more #AIEngineering #LangChain #AgenticAI #AIFrameworks #AgenticAI #AIAgents
23 Comments
Like Comment
To view or add a comment, sign in
GyaanSetu AI (Artificial Intelligence)

812 followers
3w
Report this post
𝗧𝗵𝗲 𝗕𝗶𝗴 𝗜𝗱𝗲𝗮: 𝗪𝗵𝘆 𝗪𝗲 𝗕𝘂𝗶𝗹𝘁 𝗔𝗜𝗠𝗔 You want to use AI models, but it's hard to manage them. - You need to install models for others all the time. - You spend a lot of money on AI servers, but their value goes down fast. New models come out every few months. - They are twice as powerful as the old ones. - The old hardware is not broken, but it's not worth much anymore. Our old platform was not good enough. - It was hard to update models and engines. - It was not compatible with new hardware. Some tools like Ollama and LM Studio are easy to use. - But they are not good for advanced users. - They make you choose between low cost and high performance. AIMA is different. - It helps you get the best performance from your device. - It reduces the total cost of ownership. AIMA is a thin layer of infrastructure. - It detects your hardware and matches it with the best engine and model. - It generates configurations and spins up the inference service. You can use AIMA with a few commands: - Detect your hardware: aima hal detect - Initialize the infrastructure: sudo aima init - Deploy a model: aima deploy apply --model qwen3.5-35b-a3b AIMA supports many hardware types and engines. - It uses a knowledge base to make decisions. - You can contribute to the knowledge base by writing YAML files. AIMA has a unique design. - It exposes 57 tools that can be controlled by AI agents. - It has five levels of progressive intelligence, from default values to advanced AI control. AIMA is open-sourced and ready to use. - It's available on GitHub under the Apache
Like Comment
To view or add a comment, sign in
Shekhar Yadav
3w
Report this post
If you've been building with LLMs for more than a few months, you know the drill. You architect your product around Model A, get it production-ready, and then Model B drops with better reasoning, a bigger context window, and cheaper tokens. Two weeks later, another provider ships a breaking upgrade. Pricing shifts. Latency improves. Your eval suite? Suddenly outdated. Welcome to the fastest-moving infrastructure layer in software history. The teams that survive don't treat models as hard dependencies. They treat them as swappable infrastructure. In other words: they build LLM-agnostic. At 23v, we've been using multiple models across internal and client projects, and we've developed patterns that keep us sane and forward-compatible. We just published what we've learned: https://lnkd.in/gSK-2PEP What patterns are you or your team following to stay flexible? Drop a comment and let's compare notes. If this resonates, give it a repost or follow me for more insights on building resilient AI products. #AI #LLM #ProductDevelopment #MachineLearning #SoftwareEngineering #AIInfrastructure #TechLeadership #ProductManagement #BuildInPublic #AIProducts

Building LLM-Agnostic: Why Your AI Product Needs to Survive the Next Model Drop 23v.co
Like Comment
To view or add a comment, sign in
Srikanth Ganta
3w Edited
Report this post
I feel like a platform that provides this workflow should exist: You use frontier models: GPT-X / Claude-Y / whatever's best to collect and label domain-specific data. You build up a dataset of high-quality outputs for your particular use case. Then you distill that into a small base model via LoRA adapters. One adapter per firm, per task. For training, you rent a few HX00s by the hour instead of paying per token. And at inference time, you can either pay-per-token if your usage is bursty (serverless), or rent-by-the-hour if it makes sense cost-wise. The inference optimizations are handled by the infra provider. I feel like all the individual pieces exist today: LoRA, good fine-tuning frameworks, inference libraries (llama.cpp, vLLM), really good open-weights models. And yet this whole package barely exists as a product. A few years ago, I saw Predibase building something like this. LoRAX, their serving framework, let you host a base model and dynamically load LoRA adapters per request. Then Rubrik acquired them, and the standalone product disappeared. What happened to it? Maybe someone is already building this and I just don't know about it. But if so, the fact that it's not obvious or easily accessible itself feels like a problem. I'm not sure what's actually stopping this. Is it an engineering challenge? - Is adapter swapping at scale way harder than it looks? - Is doing inference optimizations across various models just very different? Or is it an infra issue? - Does this just not make economical sense? Does the math not work out for anyone owning the hardware? Maybe it just makes more sense to rent the hardware to the big players? Or is it a market issue? - Am I just overestimating how many people would want a product like this? Or is it just early? Right now, I feel like we're over-provisioning intelligence for simple tasks. What am I missing? Does anyone have similar or opposing thoughts? Also, I slowly want to move away to substack, sooo: https://lnkd.in/eXdaU8ez
Like Comment
To view or add a comment, sign in
SambaNova

93,073 followers
1mo Edited
Report this post
🚨 MiniMax 2.5 is here and it’s FAST. Build real-world productivity agents on SambaCloud with M2.5, the latest frontier model for coding, search, and agentic workflows. Why it’s a game-changer: ✅ 80.2% on SWE-Bench ✅ 300 + tokens/sec (world’s fastest) ✅ 37% faster than predecessors ✅ Enterprise-ready on SambaCloud today Dev Tier access drops soon. Whether you're building AI agents or optimizing workflows, M2.5 is engineered for efficiency at scale. Read more in our blog 👇 https://lnkd.in/g-vVevnP

Build Real-World Productivity Agents on SambaCloud with MiniMax 2.5 sambanova.ai

6 Comments
Like Comment
To view or add a comment, sign in
Harry Ault
1mo
Report this post
Coding models are being pervasively adopted at incredibly fast rates. PROUD of our product and eng team getting the highly demanded MiniMax model available early with world's fastest inference - build away folks... FAST! 😎 👍

SambaNova

93,073 followers
1mo Edited

🚨 MiniMax 2.5 is here and it’s FAST. Build real-world productivity agents on SambaCloud with M2.5, the latest frontier model for coding, search, and agentic workflows. Why it’s a game-changer: ✅ 80.2% on SWE-Bench ✅ 300 + tokens/sec (world’s fastest) ✅ 37% faster than predecessors ✅ Enterprise-ready on SambaCloud today Dev Tier access drops soon. Whether you're building AI agents or optimizing workflows, M2.5 is engineered for efficiency at scale. Read more in our blog 👇 https://lnkd.in/g-vVevnP

Build Real-World Productivity Agents on SambaCloud with MiniMax 2.5 sambanova.ai
Like Comment
To view or add a comment, sign in

17,706 followers

View Profile Connect

Surge AI Tests AI Models in Simulated Job Setting

More from this author

You are your objective function. Which fork do you choose?

Explore content categories