🤖 𝗪𝗲 𝗯𝗲𝗻𝗰𝗵𝗺𝗮𝗿𝗸𝗲𝗱 𝗘𝗹𝗲𝘃𝗲𝗻𝗔𝗴𝗲𝗻𝘁𝘀. 𝗛𝗲𝗿𝗲'𝘀 𝘄𝗵𝗮𝘁 𝘄𝗲 𝗳𝗼𝘂𝗻𝗱. Building a great voice agent isn't just about a great model — it's about every layer of the stack working together under real conditions. That's why we recently released EVA-Bench, our open-source framework for evaluating voice agents end-to-end across 𝗘𝗩𝗔-𝗔 (Accuracy) and 𝗘𝗩𝗔-𝗫 (Experience). This week, we evaluated ElevenLabs’ ElevenAgents, and here are the results: • 𝗦𝗰𝗿𝗶𝗯𝗲 𝘃𝟮.𝟮 𝗥𝗲𝗮𝗹𝘁𝗶𝗺𝗲 𝗶𝘀 𝘁𝗵𝗲 𝘀𝘁𝗿𝗼𝗻𝗴𝗲𝘀𝘁 𝗦𝗧𝗧 𝗺𝗼𝗱𝗲𝗹 𝘄𝗲 𝘁𝗲𝘀𝘁𝗲𝗱. Transcription accuracy on key entities above 95%, holding above 93% under French accent and coffee shop background noise. • 𝗘𝗹𝗲𝘃𝗲𝗻 𝗙𝗹𝗮𝘀𝗵 𝘃𝟮 𝘀𝗰𝗼𝗿𝗲𝗱 𝗮𝗯𝗼𝘃𝗲 𝟵𝟳% 𝗼𝗻 𝗦𝗽𝗲𝗲𝗰𝗵 𝗙𝗶𝗱𝗲𝗹𝗶𝘁𝘆 - the only metric in any end-to-end voice agent benchmark that evaluates what the agent actually says out loud. • Of all 16 systems we benchmarked, 𝗯𝗼𝘁𝗵 𝗘𝗹𝗲𝘃𝗲𝗻𝗔𝗴𝗲𝗻𝘁𝘀 𝗰𝗼𝗻𝗳𝗶𝗴𝘂𝗿𝗮𝘁𝗶𝗼𝗻𝘀 𝗹𝗮𝗻𝗱𝗲𝗱 𝗼𝗻 𝘁𝗵𝗲 𝗣𝗮𝗿𝗲𝘁𝗼 𝗳𝗿𝗼𝗻𝘁𝗶𝗲𝗿. With Claude Haiku, the highest EVA-X score. With GPT-5.4, the highest EVA-A. Shaheen Lavie-Rouse, FDE at ElevenLabs, said it best: "𝘌𝘝𝘈-𝘉𝘦𝘯𝘤𝘩 𝘪𝘴 𝘵𝘩𝘦 𝘧𝘪𝘳𝘴𝘵 𝘣𝘦𝘯𝘤𝘩𝘮𝘢𝘳𝘬 𝘵𝘩𝘢𝘵 𝘢𝘤𝘵𝘶𝘢𝘭𝘭𝘺 𝘮𝘦𝘢𝘴𝘶𝘳𝘦𝘴 𝘷𝘰𝘪𝘤𝘦 𝘢𝘨𝘦𝘯𝘵 𝘲𝘶𝘢𝘭𝘪𝘵𝘺 𝘦𝘯𝘥-𝘵𝘰-𝘦𝘯𝘥 - 𝘧𝘳𝘰𝘮 𝘵𝘳𝘢𝘯𝘴𝘤𝘳𝘪𝘱𝘵𝘪𝘰𝘯 𝘵𝘰 𝘵𝘢𝘴𝘬 𝘤𝘰𝘮𝘱𝘭𝘦𝘵𝘪𝘰𝘯 𝘵𝘰 𝘴𝘱𝘰𝘬𝘦𝘯 𝘰𝘶𝘵𝘱𝘶𝘵 - 𝘪𝘯 𝘢 𝘸𝘢𝘺 𝘵𝘩𝘢𝘵'𝘴 𝘢𝘤𝘵𝘪𝘰𝘯𝘢𝘣𝘭𝘦 𝘧𝘰𝘳 𝘱𝘦𝘰𝘱𝘭𝘦 𝘣𝘶𝘪𝘭𝘥𝘪𝘯𝘨 𝘢 𝘷𝘰𝘪𝘤𝘦 𝘢𝘨𝘦𝘯𝘵. 𝘞𝘦'𝘳𝘦 𝘢𝘭𝘴𝘰 𝘨𝘭𝘢𝘥 𝘵𝘰 𝘴𝘦𝘦 𝘵𝘸𝘰 𝘥𝘪𝘧𝘧𝘦𝘳𝘦𝘯𝘵 𝘌𝘭𝘦𝘷𝘦𝘯𝘈𝘨𝘦𝘯𝘵𝘴 𝘤𝘰𝘯𝘧𝘪𝘨𝘶𝘳𝘢𝘵𝘪𝘰𝘯𝘴 𝘰𝘯 𝘵𝘩𝘦 𝘗𝘢𝘳𝘦𝘵𝘰 𝘧𝘳𝘰𝘯𝘵𝘪𝘦𝘳." This is one of many evaluations we'll be releasing. If you're building voice agents — or building the models that power them — run EVA-Bench on your stack and show us where you land. 🔎 Want to know more? 🌐 𝗪𝗲𝗯𝘀𝗶𝘁𝗲: https://lnkd.in/eaRnvm7G 💻 𝗖𝗼𝗱𝗲: https://lnkd.in/e53m3GYe 🗂️ 𝗗𝗮𝘁𝗮𝘀𝗲𝘁: https://lnkd.in/erCqPkc6 📄 𝗣𝗮𝗽𝗲𝗿: https://lnkd.in/e_ShvWDw Technical Contributors: Tara Bogavelli, Gabrielle Gauthier Melançon, Katrina Stankiewicz, Oluwanifemi Bamgbose, Fanny Riols, Hoang Nguyen, Raghav Mehndiratta, Lindsay Brin, Joseph Marinier, Hari Subramani Leadership & Product: Sridhar Krishna Nemala, Anil Kumar Madamala, Srinivas Sunkara, Joyce Li, Nitin Aggarwal #VoiceAI #VoiceAgents #AIResearch #ConversationalAI #ServiceNowResearch #ElevenLabs
Very interesting Jack Smith - thanks for reposting!Ira Simon - why not working together, again?
End-to-end evaluation is especially important for voice agents because each layer can fail differently. Transcription accuracy, spoken response quality, latency, and task completion all affect the user experience. Benchmarks that measure the full stack are more useful than isolated model scores for production decisions.