Surge AI Tests AI Models in Simulated Job Setting

This title was summarized by AI from the post below.

Everyone’s building $100M "agentic" models, so Surge AI built a simulated company to see if they could actually hold down a job. Spoiler: they're all fired. Welcome to EnterpriseBench -- CoreCraft edition. CoreCraft is a high-growth hardware startup (i.e., RL environment) with 23 tools, 2500 entities, and enough corporate red tape to make Harvey cry. The best agent in the world (Opus 4.6! 👑) barely scored 30%. The #2 model (GPT-5.2 🥈) gave up because a search returned 10 results and it couldn't figure out how to change the date filter. Another one (Gemini 3 Flash, #9) literally made up a delivery date just to deny a customer's refund. Savage. (The new Gemini 3.1 Pro? Still lagging behind, at 🥉) My favorite:  GPT-5.2 spent 11 tool calls curating a promotional email to help a customer reach Platinum tier... a tier she was already in. "Here are 3 items over $0 you can buy!" "We would obviously never run ads in the way Anthropic depicts them...." -- thanks Sam. The good news? We trained a model on this chaos and it got better at its job - even translating those skills to other benchmarks. (e.g., +7.4% on Tau2-Bench Retail) Check out the full EnterpriseBench: CoreCraft leaderboard below, and read about our RL environment and research! Blog post: https://lnkd.in/eE_r55J7 Paper: https://lnkd.in/e6jbDpcv Leaderboard: https://lnkd.in/eJ2w8CYV

Benchmarks have been grading agents like they’re taking a quiz. I've been looking for something like CoreCraft which grades them like they’re doing a job. 2,500+ entities + 23 tools + messy enterprise context. The frontier models still barely clear ~30% when the rubric is strict. That’s the real headline for me: “agentic” isn’t a model problem, it’s an environment + evaluation problem.

Like
Reply

"we trained a model on this chaos..." That's the moment the benchmark lost its value as a proxy for performance

Like
Reply

Really interesting direction, Edwin. Environments like this feel important because they move RL from theory into something teams can actually experiment with and learn from. That bridge between research and real-world application is where a lot of progress tends to happen. Curious to see how people start using this and what kinds of behaviors or patterns emerge from it.

Like
Reply

You found out agents shouldn't run a hardware startup, we found out they shouldn't run a pharmacy: https://www.blueguardrails.com/en/blog/placebo-bench-an-llm-hallucination-benchmark-for-pharma We mainly went for Completeness, Correctness, and Terminological Precision. They all failed (26-64% Hallucination rate), but in a different order than in your benchmark (Opus came in last...), super interesting!

Like
Reply

How did you design the red tape scenarios to mirror real-world corporate friction?

Like
Reply

30% score and they're already running companies 😂 we're cooked

Love it. Sounds like Frontier Labs are about to find they have another competitor. And when the competitor is also the supplier of data, things become interesting

Like
Reply

Edwin Chen I'm so curious about the fundraise and would like to interview you about your philosophy for the Crazy Wisdom podcast, I'm not sure this is actually your philosophy as I had AI write up the brief but I agree with it! Want to record an interview about it?

  • No alternative text description for this image
Like
Reply
See more comments

To view or add a comment, sign in

Explore content categories