Evaluating AI agents isn’t just about whether they complete a task. It’s about understanding how they fail. In this case study, Turing built a new approach to evaluating computer-use agents using: • 900+ structured tasks across real workflows • Paired task design to compare correct vs. failure scenarios • A taxonomy of failure modes to pinpoint where things break The result is a more realistic and actionable evaluation framework. Instead of surface-level success metrics, this method: • Distinguishes between true failures, side effects, and misunderstandings • Captures full interaction telemetry for debugging and iteration • Enables measurable progress on long-horizon agent performance Better evals lead to better agents. And ultimately, more reliable AI in production. https://lnkd.in/gjQ3BCiC
Turing
Technology, Information and Internet
San Francisco, California 1,803,104 followers
Accelerating Superintelligence
About us
Turing is one of the world’s fastest-growing AI companies accelerating the advancement and deployment of powerful AI systems. Turing helps customers in two ways: Working with the world’s leading AI labs to advance frontier model capabilities in thinking, reasoning, coding, agentic behavior, multimodality, multilinguality, STEM and frontier knowledge; and leveraging that work to build real-world AI systems that solve mission-critical priorities for companies. Powering this growth is Turing’s talent cloud—an AI-vetted pool of 4M+ software engineers, data scientists, and STEM experts who can train models and build AI applications. All of this is orchestrated by ALAN—our AI-powered platform for matching and managing talent, and generating high-quality human and synthetic data to improve model performance. ALAN also accelerates workflows for model and agent evals, supervised fine-tuning, reinforcement learning, reinforcement learning with human feedback, preference-pair generation, benchmarking, data capture for pre-training, post-training, and building AI applications. Turing—based in San Francisco, California—was named #1 on The Information’s annual list of “Top 50 Most Promising B2B Companies,” and has been profiled by Fast Company, TechCrunch, Reuters, Semafor, VentureBeat, Entrepreneur, CNBC, Forbes, and many others. Turing’s leadership team includes AI technologists from Meta, Google, Microsoft, Apple, Amazon, X, Stanford, Caltech, and MIT.
- Website
-
http://turing.com/s/wY0xCJ
External link for Turing
- Industry
- Technology, Information and Internet
- Company size
- 1,001-5,000 employees
- Headquarters
- San Francisco, California
- Type
- Privately Held
- Founded
- 2018
- Specialties
- B2B, AI, Machine Learning, Hire Developers, AI Services, Tech Services, LLM Trainer Services, AGI Infrastructure, and AI Agents
Locations
-
Primary
Get directions
548 Market St
San Francisco, California 94104, US
Employees at Turing
Updates
-
In our latest case study we evaluated 1,600+ AI-generated videos. The biggest problem was not model quality. It was evaluation. Key results: -90% annotator agreement -100% first-pass acceptance -80% success separating near-identical videos Most teams are still judging video outputs as a single “looks good” decision. That breaks at scale. We fixed it by separating what actually matters: -Caption alignment scored at the element level -Fidelity measured through physics and motion realism -Visual quality evaluated independently -No mixing signals. No subjective shortcuts. This is what made consistent evaluation possible and removed annotator drift. If your evaluation framework cannot distinguish between alignment, realism, and quality, your model improvements are guesswork. Evaluation design is now a competitive advantage. To learn more and access additional case studies: https://lnkd.in/gMRrnbZZ
-
-
A major step toward the agentic era of AI. Turing and Saudi-based HUMAIN have partnered to launch the first marketplace for enterprise-grade AI agents, announced yesterday at the FII Institute Priority Summit in Miami. Built on Humain One, the platform will enable businesses to discover, deploy, and scale AI agents across functions like HR, finance, legal, and operations. Not just software that supports work, but systems that execute it. Backed by the Public Investment Fund (PIF), Humain brings AI infrastructure and orchestration. Turing contributes deep expertise in model evaluation, fine-tuning, and reasoning systems. The vision goes further. A shared marketplace where developers can publish and monetize AI agents, unlocking a new layer of the AI economy. As Turing becomes the first US-based customer of Humain One, this partnership signals something bigger: Saudi Arabia’s emergence as a global builder of AI, not just a consumer. It also reflects the country’s broader push to position itself as a global AI hub, with the Public Investment Fund investing heavily in AI infrastructure and platforms as part of Vision 2030. The next generation of enterprise software is autonomous, scalable, and already taking shape. Read more in Fast Company Middle East. https://lnkd.in/gNYa5gRh
-
Turing reposted this
Six months ago, this Claude Code workshop program didn't exist. What started as a conversation with Anthropic last November has grown into a full program. Producing these sessions with our tech leads and marketing team has been one of the most rewarding things I've done at Turing - a moment to work through real enterprise adoption challenges and feel the AI industry moving forward. I'm proud to continue our partnership with Anthropic and bring agentic coding to our clients across North America. More cities ahead. Reach out if your team wants to be in the room.
We spent yesterday in NYC with a room full of engineers and AI leaders from enterprise teams, getting hands-on with Claude Code and what agentic development actually looks like in practice. A few themes that stood out: -The shift from copilots to agents is real Developers are moving from writing code line by line to delegating multi-step tasks across entire codebases. -Workflows matter more than prompts -Teams that succeed aren’t just “prompting better,” they’re designing structured workflows with context, memory, and reusable capabilities. -Skills > prompts One of the biggest unlocks: turning one-off prompts into persistent, reusable workflows that scale across teams. -Governance is the bottleneck AI adoption is no longer the question. Operationalizing it with the right guardrails, review loops, and SDLC integration is where most teams struggle. -Spec-driven development is emerging as the path forward Moving from “prompt, paste, pray” to structured, spec-first workflows is what enables speed without accumulating technical debt. We also spent time building live, going from idea → implementation → deployment using Claude Code in real dev environments. Great energy in the room and strong participation from teams actively figuring out how to move from experimentation to production. Thanks to everyone who joined us in NYC. More to come. Explore how we’re working with Anthropic → https://lnkd.in/gnKNcftK
-
-
We are excited to announce a strategic partnership between Turing and HUMAIN to build the world’s first enterprise-scale AI Agent Marketplace on HUMAIN ONE. This collaboration brings together HUMAIN’s AI operating system and infrastructure with Turing’s expertise in frontier AI systems, evaluation, and deployment to unlock a new era of enterprise intelligence. The HUMAIN ONE AI Agent Marketplace will enable organizations to: • Discover and deploy AI agents across every business function • Scale intelligent workflows across HR, finance, legal, operations, and beyond • Build and monetize enterprise-ready AI agents in a secure, governed environment “Superintelligence should not remain abstract. It should deliver productivity, increase ease of use, and unleash humanity’s untapped potential.” — Jonathan Siddharth, CEO and Co-Founder of Turing Together, we are accelerating the shift from traditional software to agent-driven organizations, where AI not only supports work but executes it. This partnership also marks an important milestone in advancing superintelligence from concept to real-world impact, across the Kingdom of Saudi Arabia and globally. By combining advanced AI systems with human judgment and expertise, Turing and HUMAIN aim to unlock new levels of productivity, accelerate innovation, and drive long-term economic growth. Learn more about how we are shaping the next generation of AI infrastructure and innovation: https://lnkd.in/gWuCueHD CC Tareq Amin, Nitin Sathawane, Saejong Lee
-
-
CASE STUDY: 2,000+ scientific coding tasks. Verified answers. Built for real-world science, not toy benchmarks. Turing delivered a research-grade STEM dataset spanning physics, chemistry, mathematics, and biology, designed specifically for frontier model training on benchmarks like SciCode. What makes this different: -2,000+ tasks requiring Python-based problem solving -Problems intentionally impractical to solve by hand in under a day -Ground truth answers with strict precision and tolerance handling -Closed-ended, self-contained questions with stable grading Built for real scientific workflows, not just coding exercises. To ensure quality at scale, we implemented a 5-stage pipeline: -Requirement-anchored dataset design to eliminate ambiguity -Expert-authored problems requiring computational workflows -Taxonomy-driven diversity controls across domains and subdomains -Multi-stage QA with agentic checks, L1 prompt review, and dual-validator L2 scientific validation -Client-platform trialing with pass@k filtering to enforce meaningful difficulty bands The result: ✔ Production-ready dataset for vertical AI models ✔ Verified, consensus-backed scientific accuracy ✔ Robust evaluation with reduced grading noise ✔ Tasks optimized to avoid “always pass” and “always fail” outcomes If you want models that can reason through real scientific workflows, the training data has to match that complexity. See the full case study:��https://lnkd.in/g9v_wbgE
-
-
We spent yesterday in NYC with a room full of engineers and AI leaders from enterprise teams, getting hands-on with Claude Code and what agentic development actually looks like in practice. A few themes that stood out: -The shift from copilots to agents is real Developers are moving from writing code line by line to delegating multi-step tasks across entire codebases. -Workflows matter more than prompts -Teams that succeed aren’t just “prompting better,” they’re designing structured workflows with context, memory, and reusable capabilities. -Skills > prompts One of the biggest unlocks: turning one-off prompts into persistent, reusable workflows that scale across teams. -Governance is the bottleneck AI adoption is no longer the question. Operationalizing it with the right guardrails, review loops, and SDLC integration is where most teams struggle. -Spec-driven development is emerging as the path forward Moving from “prompt, paste, pray” to structured, spec-first workflows is what enables speed without accumulating technical debt. We also spent time building live, going from idea → implementation → deployment using Claude Code in real dev environments. Great energy in the room and strong participation from teams actively figuring out how to move from experimentation to production. Thanks to everyone who joined us in NYC. More to come. Explore how we’re working with Anthropic → https://lnkd.in/gnKNcftK
-
-
AI is moving fast. The gap between research and production is closing even faster. Join Us in San Francisco with a curated group of AI researchers, product leaders, and enterprise builders for an evening of conversation, drinks, and real-world insights. Spots are limited. When: Tuesday, April 7 - 6:00 PM - 9:00 PM PDT Where: San Francisco, California Register now: https://luma.com/49emqfjw
-
-
Turing has been named one of Fast Company's Most Innovative Artificial Intelligence Companies of 2026! The recognition comes at a defining moment for AI. Bigger models. More data. Greater compute. Now paired with AI coding tools that are helping build the next generation of systems. Proud to be shaping what comes next! See the complete list: https://lnkd.in/e_cM4yBC
-
Enterprises are deploying AI in workflows where mistakes are costly, visible, and regulated. The constraint is not model capability. It is accountability. Autonomous-first systems break down when decisions cannot be explained, traced, or reproduced. Hallucinations slip through, drift introduces risk, and audit gaps surface under scrutiny. “The model said so” is not defensible. The difference between pilot and production is architectural. Leading teams are adopting human-guided AI by design: -Confidence-based gating and risk-tiered decisions -Structured human review for low-confidence or high-risk cases -Deterministic validation before execution -Full traceability across every step This is not about slowing automation. It is about applying human judgment where risk is highest. The result is partial autonomy: Automation scales routine work. Humans resolve ambiguity and edge cases. Every outcome is explainable and auditable. Governance must be built in from the start. That's how AI systems hold up in real-world, regulated environments. Frontier lab insight. Enterprise-ready AI. Talk to a Turing specialist below. https://lnkd.in/gHTpmHZc