AI Language Model Benchmarks

Explore top LinkedIn content from expert professionals.

Summary

AI language model benchmarks are standardized tests used to measure and compare how well artificial intelligence models perform tasks like writing code, planning actions, and understanding complex information. These benchmarks help researchers and businesses evaluate AI capabilities and decide which models are best suited for specific real-world needs.

  • Review multiple benchmarks: Compare models across diverse tests to understand their strengths in areas such as coding, planning, and general reasoning.
  • Check real-world task scores: Look for models that perform well on benchmarks designed to mimic practical tasks, which can indicate readiness for deployment.
  • Consider cost-accuracy tradeoffs: Assess both the accuracy and price of AI models, since the most accurate model may not always be the best value for your project.
Summarized by AI based on LinkedIn member posts
  • View profile for Sayash Kapoor

    CS Ph.D. Candidate at Princeton University and Senior Fellow at Mozilla

    11,574 followers

    How does GPT-5 compare against Claude Opus 4.1 on agentic tasks? Since their release, we have been evaluating these models on challenging tasks. Headline result: While cost-effective, so far GPT-5 never tops agentic leaderboards. We have evaluated 100+ agents across 8 benchmarks. The code is open source, and all logs are available online: hal.cs.princeton.edu. Highlights: 1) CORE-Bench (scientific reproducibility) gives agents two hours to reproduce the results from a scientific paper, given access to its code and data. Opus 4.1 is the first model to break the 50% barrier on CORE-Bench. GPT-5 is far behind — even behind Sonnet 3.7 and GPT-4.1. We've heard that AI will soon automate all of science. Reproducing results is a small part of science, but the best models are far from scoring well. Still, if AI agents can reproduce existing work, we suspect it would save millions of researcher-hours of effort, so even a 50% CORE-Bench Hard score is exciting. 2) SciCode (scientific coding) consists of 65 challenging coding problems from Math, Physics, Chemistry, Biology, and Materials Science. This is the benchmark in HAL with the poorest overall accuracy.  o3 is the best model, ahead of both Opus 4.1 and GPT-5. But it scores less than 10%. 3) AssistantBench (web) consists of 214 web assistance tasks, of which 33 are in a public validation set, which we use for HAL. Claude 4.1 Opus performs surprisingly poorly, coming in below Sonnet 3.7 and o4-mini. o3 narrowly edges out GPT-5 Medium at almost twice the cost. 4) Taubench consists of customer service tasks. Anthropic models occupy the three top spots on the leaderboard. Like many other benchmarks, GPT-5 is outperformed by other OpenAI models — in this case, o4-mini and GPT-4.1. Many of these results surprised us, and we plan to investigate them more closely. But trends across these benchmarks confirm that GPT-5 is not a step change, and does not improve upon OpenAI's other models. But it does shine in the cost-accuracy tradeoffs — often coming in much cheaper than comparable models. The website (hal.cs.princeton.edu) has a lot more analysis. We are actively improving HAL, building automated agent monitoring, and expanding the benchmarks we evaluate on. HAL is a project with Arvind Narayanan, Benedikt Stroebl, Peter Kirgis, Franck S Ndzomga, Boyi Wei, Zachary S. Siegel, Tianci Xue, Huan Sun, Yu Su, Harsh Trivedi, and many others. We are grateful to many people for feedback, including Rishi Bommasani, Yifan Mai, and Percy Liang. We're also evaluating many other challenging agent benchmarks, including AppWorld, Online Mind2Web, ScienceAgentBench, SWE-Bench, GAIA, ColBench, and USACO, across the slate of models — watch this space or follow along on hal.cs.princeton.edu

  • How far are we from having competent AI co-workers that can perform tasks as varied as software development, project management, administration, and data science? In our new paper, we introduce TheAgentCompany, a benchmark for AI agents on consequential real-world tasks. Why is this benchmark important? Right now it is unclear how effective AI is at accelerating or automating real-world work. We hear statements like: > AI is overhyped, doesn’t reason, and doesn’t generalize to new tasks > AGI will automate all human work in the next few years This question has implications for: - Companies: to understand where to incorporate AI in workflows - Workers: to get a grounded sense of what AI can and cannot do - Policymakers: to understand effects of AI on the labor market How can we begin on it? In TheAgentCompany, we created a simulated software company with tasks inspired by real-world work. We created baseline agents, and evaluated their ability to solve these tasks. This benchmark is first of its kind with respect to versatility, practicality, and realism of tasks. TheAgentCompany features four internal web sites: - GitLab: for storing source code (like GitHub) - Plane: for doing task management (like Jira) - OwnCloud: for storing company docs (like Google Drive) - RocketChat: for chatting with co-workers (like Slack) Based on these sites, we created 175 tasks in the domains of: - Administration - Data science - Software development - Human resources - Project management - Finance We implemented a baseline agent that can web browse and write/execute code to solve these tasks. This was implemented using the open-source OpenHands framework for full reproducibility (https://lnkd.in/g4VhSi9a). Based on this agent, we evaluated many LMs, Claude, Gemini, GPT-4o, Nova, Llama, and Qwen. We evaluated both success metrics and cost. Results are striking: the most successful agent w/ Claude was able to successfully solve 24% of the diverse real-world tasks that it was tasked with. Gemini-2.0-flash is strong at a competitive price point, and the open llama-3.3-70b model is remarkably competent. This paints a nuanced picture of the role of current AI agents in task automation. - Yes, they are powerful, and can perform 24% tasks similar to those in real-world work - No, they can not yet solve all tasks or replace any jobs entirely Further, there are many caveats to our evaluation: - This is all on simulated data - We focused on concrete, easily evaluable tasks - We focused only on tasks from one corner of the digital economy If TheAgentCompany interests you, please: - Read the paper: https://lnkd.in/gyQE-xZG - Visit the site to see the leaderboard or run your own eval: https://lnkd.in/gtBcmq87 And huge thanks to Fangzheng (Frank) Xu, Yufan S., and Boxuan Li for leading the project, and the many many co-authors for their tireless efforts over many months to make this happen.

  • View profile for Rahul Pandey
    Rahul Pandey Rahul Pandey is an Influencer

    GM of Coding, Handshake. Founder at Taro. Prev Meta, Stanford, Pinterest

    138,301 followers

    I spent 10 hours understanding LLM benchmarks for Software Engineering. Here's what I learned: - Oct 2023 - SWE-Bench is released by researchers from Princeton and Stanford. This benchmark evaluates how LLMs perform on 2,300 real-world issues from GitHub repositories. (shifting away from interview or contest problems, which are contrived and easy to solve.) - Aug 2024 - SWE-Bench Verified is introduced by OpenAI. This is a subset of 500 SWE-Bench issues that are actually solvable (human-reviewed). Many of the issues in the original SWE-Bench were impossible without additional context. - Dec 2024 - LMSYS WebDev Arena is launched by researchers at UC Berkeley. This is a platform for human preference evals. Thousands of users vote for which LLMs perform best in web dev challenges through pairwise comparisons. - Feb 2025 - SWE-Lancer is introduced by OpenAI: a benchmark of 1,400 freelance SWE tasks from Upwork, with a total value of $1 Million 💰 This captures the effectiveness of AI to do economically valuable work. - May 2025 - SWE-Bench Multilingual is introduced to address an obvious deficiency in the original SWE-Bench: they only used Python! This benchmark has 300 tasks across 9 programming languages: C, C++, Go, Java, JavaScript, TypeScript, PHP, Ruby and Rust. We still have a long way to go before LLMs can match the performance of the best human software engineers. For example, the best models are only hitting a 70% pass rate on SWE-Bench Verified. AI still can't resolve a meaningful percentage of bugs/features in large repositories. Moreover, LLM evaluation is heavily biased toward Python or web development (HTML, CSS, and JavaScript). Performance in other languages (like Kotlin and Swift for all of us mobile devs 📱) is much worse. Crazy how fast this space moves ⏳ but I also realized the disconnect between what the benchmarks measure and what most developers do every day.

  • View profile for Terezija Semenski, MSc

    Helping 300,000+ people master AI and Math fundamentals faster | LinkedIn [in]structor 15 courses | Author @ Math Mindset newsletter

    30,708 followers

    Everyone's playing with AI agents, from coding to customer service. Meanwhile, Princeton's top 31 researchers have been working hard this year on infrastructure for fair agent evaluations on challenging benchmarks. This paper, "Holistic agent leaderboard: The missing infrastructure for AI Agent evaluation", summarizes insights from 20,000+ agent rollouts on 9 challenging benchmarks spanning web, coding, science, and customer service tasks. The team evaluated 9 models across 4 areas (9 benchmarks), with 1-2 scaffolds per benchmark, totaling over 20,000 rollouts. This includes: 1️⃣ coding (USACO, SWE-Bench Verified Mini),  2️⃣ web (Online Mind2Web, AssistantBench, GAIA),  3️⃣ science (CORE-Bench, ScienceAgentBench, SciCode),  4️⃣ customer service tasks (TauBench) This analysis uncovered many interesting insights: 1) Higher reasoning effort does not lead to better accuracy in the majority of cases: In case the authors used the same model with different reasoning efforts (o4-mini, Claude 3.7, Claude 4.1), the higher reasoning did not improve accuracy in 21 of 36 cases. 2) Agents often take shortcuts rather than solving the task correctly: For example, to solve scientific reproduction tasks, agents would grep the Jupyter notebook and hard-code their guesses rather than reproducing the work. When they needed to solve web tasks, web agents would look up the benchmark on huggingface. 3) Agents take actions that are extremely costly in deployment:   On flight booking tasks in Taubench, agents booked flights from the incorrect airport, refunded users more than necessary, and charged the incorrect credit card. 4) Researchers analyzed the tradeoffs between cost vs. accuracy: The red dotted line represents the Pareto frontier: it captures the models with the best accuracy at a given budget. The 3 models most commonly on the frontier are Gemini 2.0 Flash (7 of 9 benchmarks), GPT-5 (4 of 9), and o4-mini Low (4 of 9). Surprisingly, the most expensive model (Opus 4.1) performs very low (1 of 9 benchmarks). 5) The most token-efficient models are not the cheapest: On comparisons of token cost vs. accuracy, Opus 4.1 is on the Pareto frontier for 3 benchmarks.  This is important because providers change model prices frequently. 6) They used TransluceAI Docent to log all the agent behaviors and analyze them:  It uses LLMs to uncover specific actions the agent took.  Then the team conducted a systematic analysis of agent logs on 3 benchmarks: AssistantBench, SciCode, and CORE-Bench.  This analysis allowed the research team to spot agents taking shortcuts and costly reliability failures. Forget the hype. Read this paper before building. Paper link in comments. What's stopping you from building agents that actually ship? ♻️ Repost to help someone skip the expensive mistakes

  • View profile for Jean Ng 🟢

    AI Changemaker | Global Top 20 Creator in AI Safety & Tech Ethics | Corporate Trainer | The AI Collective Leader, Kuala Lumpur Chapter

    42,011 followers

    Ever wondered how the smartest AI models, like the ones you see online, get their grades? They don't just get one score—they get three main indices that tell us exactly what they excel at. Think of these indices as a high-tech report card: 1. Artificial Analysis Intelligence Index (The Overall GPA) This is the grand measure of general AI intelligence. It’s incredibly comprehensive, incorporating 10 distinct evaluations. It includes benchmarks like MMLU-Pro, GPQA, Diamond, and Humanity's Last Exam, covering a wide range of knowledge and complex reasoning. 2. Artificial Analysis Coding Index (The Programming Grade) This index zeros in on the AI’s programming talents. It represents the average of specific coding benchmarks that test how well the AI can write and solve code problems. The scores are calculated from three key tests: LiveCodeBench, SciCode, and Terminal-Bench Hard. 3. Artificial Analysis Agentic Index (The Planning and Action Grade) This score measures the AI’s ability to act or plan to achieve goals—its "agentic capabilities". It is calculated by averaging the results of crucial benchmarks like r*-Bench Telecom and Terminal-Bench Hard. These three indices give experts a clear, defined way to measure an AI's brainpower, coding skills, and ability to execute tasks! Knowing these metrics helps us understand exactly where AI is advancing fastest. These indices—the Intelligence, Coding, and Agentic Indices—represent a standardised methodology for benchmarking AI performance, often published in technical analyses and research papers. They offer a consolidated, objective view of model capabilities, moving beyond simple, single-task metrics. Is that one of the considering factors when choosing an AI platform? Absolutely. If you are choosing an AI platform for serious deployment or specialised work, these indices provide the necessary granular detail that helps predict real-world performance. You wouldn't choose a car solely based on horsepower; similarly, you shouldn't choose an AI solely based on a single chat demo. Agree? Source: Artificial Analysis

  • View profile for José Manuel de la Chica
    José Manuel de la Chica José Manuel de la Chica is an Influencer

    Head of Global AI Lab at Santander | AI Research Leader

    15,660 followers

    Traditional AI benchmarks often fail to capture how language models actually perform in the real world. Now, the Inclusion Arena project, introduced by Inclusion AI with backing from Ant Group, takes a new approach: ranking LLMs and MLLMs through real user preferences collected in live applications. By applying the Bradley-Terry statistical model to millions of paired comparisons, it generates more reliable, production-oriented insights. For enterprises and developers, this matters: choosing the right model is no longer about excelling in academic benchmarks, but about delivering value in real interactions. https://lnkd.in/dS8z59MH

  • View profile for Samuel G. Rodriques

    Building an AI Scientist | Time 100 in AI 2025

    13,397 followers

    Yesterday, we released a major update to LAB-Bench, our benchmark for language agents in science. Here are the results, including Opus 4.6. Overall, OpenAI is in the lead right now. This appears mostly to be attributable to better tool use and retrieval, rather than reasoning. Gemini and Opus 4.6 match GPT 5.2 on reasoning about biological protocols, for example, but GPT 5.2 beats both Gemini and Opus by 40 points or more on answering questions about patents with tool use. Opus 4.6 shows its largest improvement over Opus 4.5 on our paper retrieval task though, suggesting that Anthropic may be making a push on that front. There is still a lot of room for improvement. None of the models can reliably access supplementary information or external datasets right now with their standard tool use harnesses, although Gemini is the best on dataset access. They also all struggle in a big way on FigQA2, which measures the ability to reason about figures in the context of a paper. The new benchmark, LAB-Bench2, evaluates agents in more realistic settings and on a broader diversity of challenges. See the benchmark here: https://lab-bench.ai/ and the paper: paper.lab-bench.ai

  • View profile for Evan Benjamin

    I create AI Nuggets and teach AI safety.

    7,867 followers

    Everyone loves MCP but has anyone benchmarked AI agents and evaluated LLMs in real-world scenarios to highlight the challenging nature of real-world MCP server interactions? Salesforce AI Research just did. Meet MCP Universe - the first comprehensive benchmark specifically designed to evaluate LLMs in realistic and hard tasks through interaction with real-world MCP servers. Even state-of-the-art models show 𝐬𝐢𝐠𝐧𝐢𝐟𝐢𝐜𝐚𝐧𝐭 𝐥𝐢𝐦𝐢𝐭𝐚𝐭𝐢𝐨𝐧𝐬 𝐢𝐧 𝐫𝐞𝐚𝐥-𝐰𝐨𝐫𝐥𝐝 𝐌𝐂𝐏 𝐢𝐧𝐭𝐞𝐫𝐚𝐜𝐭𝐢𝐨𝐧𝐬: 🥇 GPT-5: 43.72% success rate 🥈 Grok-4: 33.33% success rate 🥉 Claude-4.0-Sonnet: 29.44% success rate Key Findings you need to know: 1️⃣ In the MCP Universe benchmark, 𝐥𝐨𝐧𝐠 𝐜𝐨𝐧𝐭𝐞𝐱𝐭 𝐡𝐚𝐧𝐝𝐥𝐢𝐧𝐠 𝐩𝐨𝐬𝐞𝐬 𝐚 𝐬𝐢𝐠𝐧𝐢𝐟𝐢𝐜𝐚𝐧𝐭 𝐜𝐡𝐚𝐥𝐥𝐞𝐧𝐠𝐞 𝐟𝐨𝐫 𝐋𝐋𝐌 𝐚𝐠𝐞𝐧𝐭𝐬, particularly in the Location Navigation, Browser Automation, and Financial Analysis domains. These domains frequently require agents to process and reason over lengthy sequences of observations which often exceed the context window limits of many models. 2️⃣ LLMs often struggle to correctly use tools provided by the MCP servers, indicating a 𝐥𝐚𝐜𝐤 𝐨𝐟 𝐟𝐚𝐦𝐢𝐥𝐢𝐚𝐫𝐢𝐭𝐲 𝐰𝐢𝐭𝐡 𝐭𝐡𝐞𝐢𝐫 𝐢𝐧𝐭𝐞𝐫𝐟𝐚𝐜𝐞𝐬 𝐚𝐧𝐝 𝐜𝐨𝐧𝐬𝐭𝐫𝐚𝐢𝐧𝐭𝐬. For example, for the Yahoo Finance MCP server, retrieving a stock price requires specifying a start and end date that differ, yet LLMs frequently set them to be identical, leading to execution errors. 3️⃣ You need 𝐨𝐩𝐭𝐢𝐦𝐚𝐥 𝐚𝐠𝐞𝐧𝐭-𝐦𝐨𝐝𝐞𝐥 𝐩𝐚𝐢𝐫𝐢𝐧𝐠 to maximize performance on complex tasks. You might like enterprise-level agents like Cursor for specific domains, but it doesn't outperform simpler frameworks like ReAct. Cursor demonstrates superior performance in Browser Automation but underperforms in Web Searching. How would we know that without benchmarking? Don't just fall for MCP hype. Start benchmarking and evaluating. Thank you, Salesforce AI Research, for this valuable MCP benchmarking tool. ✅ Start here: https://lnkd.in/ekZEBads ✅ Then go here: https://lnkd.in/ejPWmVbi ✅ Read the Paper: https://lnkd.in/ea9z32au Stephen Pullum Donna Rinck Darryn Tannous Ariel Perez Himanshu Jha Saahil Gupta, Horatio Morgan Uvika Sharma Padmini Soni Noelle Russell Aishwarya Naresh Reganti Sarfaraz Muneer Erik Conn PhD. Sevinj Novruzova Kimberly L. Andreas Horn Barnabas Madda Imran Padshah Ademulegun Blessing James James Beasley

  • View profile for Heena Purohit

    Director, AI Startups @ Microsoft | Top AI Voice | Keynote Speaker | Helping Enterprise Teams Navigate AI Innovation | EB1A “Einstein Visa” Recipient | Responsible AI Advocate

    24,704 followers

    🚨 Agent Leaderboard v2 is here! General-purpose benchmarks only take you so far. And that's why Microsoft for Startups Pegasus startup Galileo introduced a domain-specific evaluation benchmark for AI agents. It's designed to simulate real enterprise tasks across banking, healthcare, insurance, telecom, and investments. Each scenario includes up to 8 interdependent goals, user personals, ambiguous or missing tools, and dynamic user intent, mimicking real-world complexity. 📈 𝗞𝗲𝘆 𝗠𝗲𝘁𝗿𝗶𝗰𝘀: - Action Completion (AC) - Did the agent actually complete the user's goals? - Tool Selection Quality (TSQ) – Did it select and use the right tools at the right time? ⚡𝗥𝗲𝘀𝘂𝗹𝘁𝘀:  - GPT-4.1 leads: Highest AC - GPT-4.1-mini offers the best cost-efficiency ($0.014/session vs. GPT-4.1’s $0.068). - Kimi K2 is the top performer among open source models - Gemini 2.5 Flash tops TSQ (94%), but lags on AC (38%) - Grok 4 did not top any metrics - Reasoning models generally underperform non-reasoning counterparts in action completion. - No single model dominated all domains 🔗 More details in the comments. Share this with someone building AI agents — they need to see this. 👇 Great job, Team Galileo! It's great to see benchmarks that are actually useful - and continually updated 👏 #AgenticAI #EnterpriseAI #AIforBusiness

Explore categories