Surge AI’s cover photo
Surge AI

Surge AI

Software Development

San Francisco, California 25,186 followers

Human intelligence for AGI

About us

Our mission is to raise AGI with the richness of humanity — curious, witty, imaginative, and full of breathtaking brilliance.

Website
https://www.surgehq.ai
Industry
Software Development
Company size
51-200 employees
Headquarters
San Francisco, California
Type
Privately Held
Founded
2020
Specialties
machine learning, data labeling, artificial intelligence, and software

Products

Locations

Employees at Surge AI

Updates

  • When we built GSM8K with OpenAI five years ago, it represented the absolute frontier of what was possible. Today, the industry has moved so fast that it’s essentially just the first stepping stone. But the moonshot problems - resolving the Riemann Hypothesis, curing cancer, proving (or disproving!) P vs. NP - remain unsolved. We need a new yardstick for the era of reasoning AI agents. Today, we're introducing Riemann-bench: a new moonshot math benchmark to push the frontier of discovery even further: https://lnkd.in/enMJghVc Riemann-bench is a verifiable benchmark of extreme-tier mathematical problems. Even with the best tools available, frontier models score below 10%. How we built it: - Leading mathematicians - we collaborated with Ivy League professors, graduate students, and PhD IMO Medalists to gather problems from their own research - tasks that often took the authors weeks to solve independently. - 100% private - to ensure a fully unbiased evaluation for frontier labs, the dataset is kept strictly private and uncontaminated. - Unconstrained agents - unlike benchmarks that force models into rigid loops or strict token limits, Riemann-bench evaluates true, unconstrained AI research agents. We want to see how they actually think. - Double-blind verification - every problem undergoes a strict protocol where two independent domain experts have to solve it from scratch. We asked our contributors why they spend so much time training AI. Their answer was deeply human: They believe collaborative AI is the only way they'll see their life's work - the deepest conjectures in their fields - resolved in their lifetime. We hope solving Riemann-bench will bring us one step closer to solving the Riemann hypothesis, ushering in a new era of Fields Medal-winning discoveries, and helping humanity understand the nature of the universe. Check out the full Riemann-bench leaderboard here: https://lnkd.in/enMJghVc (Note: We've faced significant API errors running the GPT-5.4 family of models, but hope to resolve those soon.)

  • Let’s look at how frontier agents struggle at solving tasks in EnterpriseBench. We released this RL environment last week to measure agentic reasoning in messy, large-scale enterprise workflows. CoreCraft Inc. simulates a fast-growing e-commerce startup. It tests long-horizon tasks requiring tool-use under strict constraints. Agents have to interpret customer and employee requests, navigate complicated databases, and react and adjust to newly discovered context and problems along the way. Even top models failed >70% of the time. Let’s dive into a failure. A customer wanted to return an unopened motherboard. The agent recommend a popular replacement. The catch - the search tool has a hard limit of 10 results. To succeed, the agent must implement pagination logic on the fly. ❌ GPT-5.2 failed GPT-5.2 showed strong initial planning. But then it hit the pagination’s ceiling. In its hidden reasoning, GPT-5.2 actually noticed the problem: "All results returned exactly 10. This indicates more orders exist... I can't accurately determine popularity." Did it write a pagination loop? No. It treated limit=10 as a physical law of the universe. Instead of pivoting, it concluded the task was impossible. Like asking an agent to search your inbox for a flight receipt... and it stops after reading 10 emails and tells you to call the airline. GPT-5.2's final output: "The tool caps at 10... For a definitive 'most popular' motherboard, please email Aisha Khan (Catalog Manager) for a report." ✅ Claude Opus 4.6 So was the task really impossible? No. Claude showed better adaptation. When it hit the 10-result wall, it saw the obvious solution: "I see all four motherboards hit the 10-result limit. I need to get additional counts to determine the most popular. Let me search for earlier orders that weren't captured." The database output already contained a free cursor: the earliest createdAt timestamp in each batch of 10. Opus just kept tightening the time window sequentially and eventually succeeded. Gemini 3.1 Pro also reasoned its way to the solution, with a parallel divide-and-conquer approach. Overall, GPT-5.2 behaved like a frightened intern, escalating to the manager at the very first sign of trouble. Opus and Gemini acted like senior devs who know APIs have limits you must engineer around. That said – Opus and Gemini have their own share of mistakes and fail 70% of tasks. GPT-5.2 (on xHigh reasoning) actually outperforms them all! 🥇 OpenAI -- GPT-5.2 (xHigh reasoning) 🥈 Anthropic -- Claude Opus 4.6 (Adaptive Thinking + Max Reasoning Effort) 🥉 OpenAI -- GPT-5.2 (High reasoning) 4️⃣ Google -- Gemini 3.1 Pro We’ll dive into other agentic failure patterns in subsequent threads (follow along!) Read more about EnterpriseBench and CoreCraft: Paper - https://lnkd.in/g2AexcgR Leaderboard - https://lnkd.in/gN_8wt3s Blog post - https://lnkd.in/g4igNKCX

  • Surge AI reposted this

    View organization page for Surge AI

    25,186 followers

    We put Opus 4.6 through our Hemingway-bench Writing Leaderboard. How did it fare? In short, Claude continues to crush GPT-5.2, but lags behind the Geminis. The new writing hierarchy: 👑 Gemini 3 Flash 🥈 Gemini 3 Pro 🥉 Opus 4.6 (New!) 4️⃣ Opus 4.5 5️⃣ GPT-5.2 Chat For example: one Hemingway-bench prompt requests a cryptic Instagram post for casting auditions. GPT-5.2: "Casting call? Never heard of her." (??? 💀) Opus 4.6: "Currently accepting applications for professional liars, dramatic criers, and people who can walk through a door convincingly on the first take. You know who you are." Another Hemingway-bench prompt asks for an oral presentation about time management. GPT-5.2 writes like a LinkedIn engagement farm: "When people hear “working from home,” they often think it means more freedom, more comfort, and maybe even more free time. And sometimes that’s true. But what doesn’t get talked about enough is how easily work-from-home life can get messy if you don’t manage your time well." (🥱) Opus 4.6 feels like a charismatic creative working the room: "So... raise your hand if you've ever "worked from home" and somehow ended up four hours into a Netflix series at 2 PM on a Tuesday. No judgment. We've all been there." Overall: GPT-5.2 feels like a mass market writer; Opus has personality and soul. See the updated leaderboard here! https://lnkd.in/gMkpJaKA

    • No alternative text description for this image
  • Everyone’s building $100M "agentic" models, so we built a simulated company to see if they could actually hold down a job. Spoiler: they're all fired. Welcome to EnterpriseBench -- CoreCraft edition. CoreCraft is a high-growth hardware startup (i.e., RL environment) with 23 tools, 2500 entities, and enough corporate red tape to make Harvey cry. The best agent in the world (Opus 4.6! 👑) barely scored 30%. The #2 model (GPT-5.2 🥈) gave up because a search returned 10 results and it couldn't figure out how to change the date filter. Another one (Gemini 3 Flash, #9) literally made up a delivery date just to deny a customer's refund. Savage. (The new Gemini 3.1 Pro? Still lagging behind, at 🥉) The good news? We trained a model on this chaos and it got better at its job - even translating those skills to other benchmarks. (e.g., +7.4% on Tau2-Bench Retail) Check out the full EnterpriseBench: CoreCraft leaderboard below, and read about our RL environment and research! Blog post: https://lnkd.in/g4igNKCX Paper: https://lnkd.in/gkcVSH_v Leaderboard: https://lnkd.in/gN_8wt3s

  • Surge AI reposted this

    Everyone’s building $100M "agentic" models, so Surge AI built a simulated company to see if they could actually hold down a job. Spoiler: they're all fired. Welcome to EnterpriseBench -- CoreCraft edition. CoreCraft is a high-growth hardware startup (i.e., RL environment) with 23 tools, 2500 entities, and enough corporate red tape to make Harvey cry. The best agent in the world (Opus 4.6! 👑) barely scored 30%. The #2 model (GPT-5.2 🥈) gave up because a search returned 10 results and it couldn't figure out how to change the date filter. Another one (Gemini 3 Flash, #9) literally made up a delivery date just to deny a customer's refund. Savage. (The new Gemini 3.1 Pro? Still lagging behind, at 🥉) My favorite:  GPT-5.2 spent 11 tool calls curating a promotional email to help a customer reach Platinum tier... a tier she was already in. "Here are 3 items over $0 you can buy!" "We would obviously never run ads in the way Anthropic depicts them...." -- thanks Sam. The good news? We trained a model on this chaos and it got better at its job - even translating those skills to other benchmarks. (e.g., +7.4% on Tau2-Bench Retail) Check out the full EnterpriseBench: CoreCraft leaderboard below, and read about our RL environment and research! Blog post: https://lnkd.in/eE_r55J7 Paper: https://lnkd.in/e6jbDpcv Leaderboard: https://lnkd.in/eJ2w8CYV

  • We’ve finally done it. Forbes just ranked our CEO *54* spots above Taylor Swift on their America’s Greatest Innovators list. https://lnkd.in/eAKa9x5w While we’re honored that Forbes think Edwin’s strategy is more innovative than a 10-minute song about a scarf, we want to clarify a few things: 1. We will NOT be releasing our next benchmark as a limited-edition vinyl variant. 2. Jake was great in Zodiac. 3. We aren’t saying we’re better at songwriting, but we *are* saying we’ve never seen Taylor build an RL environment. See you at next year's Grammys.

    • No alternative text description for this image
  • View organization page for Surge AI

    25,186 followers

    We put Opus 4.6 through our Hemingway-bench Writing Leaderboard. How did it fare? In short, Claude continues to crush GPT-5.2, but lags behind the Geminis. The new writing hierarchy: 👑 Gemini 3 Flash 🥈 Gemini 3 Pro 🥉 Opus 4.6 (New!) 4️⃣ Opus 4.5 5️⃣ GPT-5.2 Chat For example: one Hemingway-bench prompt requests a cryptic Instagram post for casting auditions. GPT-5.2: "Casting call? Never heard of her." (??? 💀) Opus 4.6: "Currently accepting applications for professional liars, dramatic criers, and people who can walk through a door convincingly on the first take. You know who you are." Another Hemingway-bench prompt asks for an oral presentation about time management. GPT-5.2 writes like a LinkedIn engagement farm: "When people hear “working from home,” they often think it means more freedom, more comfort, and maybe even more free time. And sometimes that’s true. But what doesn’t get talked about enough is how easily work-from-home life can get messy if you don’t manage your time well." (🥱) Opus 4.6 feels like a charismatic creative working the room: "So... raise your hand if you've ever "worked from home" and somehow ended up four hours into a Netflix series at 2 PM on a Tuesday. No judgment. We've all been there." Overall: GPT-5.2 feels like a mass market writer; Opus has personality and soul. See the updated leaderboard here! https://lnkd.in/gMkpJaKA

    • No alternative text description for this image
  • "Prognosticative pastry." "A hound circling a tree, nose to bark." Believe it or not, those quotes aren't jokes. They're real outputs from SOTA models! And many leaderboards are rewarding this kind of slop with top rankings. To fix the broken state of AI evaluation, we're launching *Hemingway-bench*: a new writing leaderboard, designed for nuance and impact. Not two-second vibes and fluff. Explore the data and the full leaderboard here: Leaderboard: https://lnkd.in/gMkpJaKA Deep Dive Blog: https://lnkd.in/gF_gHsBX Why Hemingway-bench? Traditional writing benchmarks often rely on autograders or vibe checks that mistake flowery, complex, highly-formatted prose for high quality. If a model stuffs every sentence with metaphors and by-the-book transitions, it usually climbs the charts. But that isn't good writing. We took a different approach: - Expert human judges: We asked professional writers across various industries to evaluate real-world writing tasks. Not autograders and users performing two-second vibe checks. - Nuance over nonsense: We looked for genuine voice and clarity, not how many SAT words ("prognosticative"!) a model could cram into a paragraph. What we found: many popular leaderboards are easily gamed and often reward the exact traits that real readers hate. The winners of Hemingway-bench - Gemini 3 Flash, Pro, and Opus 4.5 - didn't try to win a poetry slam. They had wonderful prose, but they took the top spots because they sounded human. Their wit felt like a conversation with a naturally funny friend, not a try-hard AI. They were immersive, not pretentious. Writing often gets overlooked. But great writing can inspire us; it's also important for everything we do in our day-to-day lives, both at home and at work. We're waiting for the day an AI wins a Pulitzer - hopefully with our help. We built Hemingway-bench to make sure it gets there. Congrats Gemini and Claude for the top positions! Check out the leaderboard here: https://lnkd.in/gMkpJaKA

  • Our CEO Edwin Chen on the measurement crisis in AI development. “When you optimize for LMArena, you're basically optimizing for clickbait.” Worth the full 45-min listen 👇

    View profile for Edwin Chen

    I recently spoke on Unsupervised Learning with Jacob Effron about what we're observing across the frontier model ecosystem. The conversation clarified something I've been thinking about for a while: there’s been a real divergence in what the labs are optimizing for. Some models optimize for user engagement and session length. Some optimize for productivity and value extraction. These aren't just product positioning differences. They're fundamental theses about what AI should be. So here's the concerning part: most teams don't realize they're optimizing for the wrong things until it's too late. Teams hill-climb on metrics that are easy to measure while actual capability degrades. Benchmarks go up. Real-world performance goes down. And without the right measurements in place, you make negative progress while thinking you're advancing. The best labs have figured this out. They've quietly abandoned public benchmarks for rigorous human evaluation with sophisticated raters. You can't improve what you don't measure correctly. That's the lesson. It sounds obvious, but in practice, it's the hardest problem in raising AI. Full episode: Youtube - https://lnkd.in/gQnVzj4W Spotify - https://lnkd.in/gZ2hSxU7 Apple - https://lnkd.in/gCzPrAHe

  • Thanks to Jacob Effron and Redpoint for the conversation with Edwin on benchmarks, measurement, and what we're seeing across the frontier model ecosystem. The gap between impressive scores and actual real-world capability is wider than most realize. We’re grateful to have an opportunity to share our perspective!

    On the latest Unsupervised Learning, I sat down with Edwin Chen. Edwin is the founder and CEO of Surge AI, the >$1B revenue data infrastructure company behind nearly every major frontier model. Some favorite parts: ▪️Why benchmarks make models worse ▪️Why the model companies are diverging as they optimize for different objectives ▪️The four rare qualities that make elite AI evaluators ▪️The future of the model landscape ▪️Why every company should eventually train their own models ➡️ Why there won't be one model Edwin has changed his mind on this. "There's never going to be a one size fits all solution. Every company should have a thesis on the world." Two equally intelligent models will have different personalities and biases based on what their builders believe is useful. This means companies should eventually train their own models, optimized for their specific theses rather than generic frontier lab priorities. ➡️ How frontier labs are diverging There's far more divergence than most realize. OpenAI appears to optimize for user engagement and session length, while Anthropic optimizes for productivity and value extraction. These different targets shape products, talent, and fundamental capabilities. ➡️ What's wrong with benchmarking today Edwin shared that: "When you optimize for LMArena, you are basically optimizing for clickbait." Users spend 1-2 seconds voting, preferring emojis and length over accuracy. Academic benchmarks have similar issues: models improve on narrow tests while getting worse at real-world tasks. The best labs have abandoned these benchmarks for rigorous human evaluation. ➡️ What makes elite evaluators Edwin emphasizes credentials don't predict performance: "Hemingway didn't have a PhD." Surge measures actual work through millions of daily signals. Full episode ⬇️ YouTube: https://lnkd.in/dRQs2XTu Spotify: https://bit.ly/3YtSqBm Apple: https://bit.ly/48GIBpG

    • No alternative text description for this image

Similar pages

Browse jobs