Evaluating Voice Agents with EVA-Bench

This title was summarized by AI from the post below.

View organization page for ServiceNow AI Research

52,039 followers

🤖 𝗪𝗲 𝗯𝗲𝗻𝗰𝗵𝗺𝗮𝗿𝗸𝗲𝗱 𝗘𝗹𝗲𝘃𝗲𝗻𝗔𝗴𝗲𝗻𝘁𝘀. 𝗛𝗲𝗿𝗲'𝘀 𝘄𝗵𝗮𝘁 𝘄𝗲 𝗳𝗼𝘂𝗻𝗱. Building a great voice agent isn't just about a great model — it's about every layer of the stack working together under real conditions. That's why we recently released EVA-Bench, our open-source framework for evaluating voice agents end-to-end across 𝗘𝗩𝗔-𝗔 (Accuracy) and 𝗘𝗩𝗔-𝗫 (Experience). This week, we evaluated ElevenLabs’ ElevenAgents, and here are the results: • 𝗦𝗰𝗿𝗶𝗯𝗲 𝘃𝟮.𝟮 𝗥𝗲𝗮𝗹𝘁𝗶𝗺𝗲 𝗶𝘀 𝘁𝗵𝗲 𝘀𝘁𝗿𝗼𝗻𝗴𝗲𝘀𝘁 𝗦𝗧𝗧 𝗺𝗼𝗱𝗲𝗹 𝘄𝗲 𝘁𝗲𝘀𝘁𝗲𝗱. Transcription accuracy on key entities above 95%, holding above 93% under French accent and coffee shop background noise. • 𝗘𝗹𝗲𝘃𝗲𝗻 𝗙𝗹𝗮𝘀𝗵 𝘃𝟮 𝘀𝗰𝗼𝗿𝗲𝗱 𝗮𝗯𝗼𝘃𝗲 𝟵𝟳% 𝗼𝗻 𝗦𝗽𝗲𝗲𝗰𝗵 𝗙𝗶𝗱𝗲𝗹𝗶𝘁𝘆 - the only metric in any end-to-end voice agent benchmark that evaluates what the agent actually says out loud. • Of all 16 systems we benchmarked, 𝗯𝗼𝘁𝗵 𝗘𝗹𝗲𝘃𝗲𝗻𝗔𝗴𝗲𝗻𝘁𝘀 𝗰𝗼𝗻𝗳𝗶𝗴𝘂𝗿𝗮𝘁𝗶𝗼𝗻𝘀 𝗹𝗮𝗻𝗱𝗲𝗱 𝗼𝗻 𝘁𝗵𝗲 𝗣𝗮𝗿𝗲𝘁𝗼 𝗳𝗿𝗼𝗻𝘁𝗶𝗲𝗿. With Claude Haiku, the highest EVA-X score. With GPT-5.4, the highest EVA-A. Shaheen Lavie-Rouse, FDE at ElevenLabs, said it best: "𝘌𝘝𝘈-𝘉𝘦𝘯𝘤𝘩 𝘪𝘴 𝘵𝘩𝘦 𝘧𝘪𝘳𝘴𝘵 𝘣𝘦𝘯𝘤𝘩𝘮𝘢𝘳𝘬 𝘵𝘩𝘢𝘵 𝘢𝘤𝘵𝘶𝘢𝘭𝘭𝘺 𝘮𝘦𝘢𝘴𝘶𝘳𝘦𝘴 𝘷𝘰𝘪𝘤𝘦 𝘢𝘨𝘦𝘯𝘵 𝘲𝘶𝘢𝘭𝘪𝘵𝘺 𝘦𝘯𝘥-𝘵𝘰-𝘦𝘯𝘥 - 𝘧𝘳𝘰𝘮 𝘵𝘳𝘢𝘯𝘴𝘤𝘳𝘪𝘱𝘵𝘪𝘰𝘯 𝘵𝘰 𝘵𝘢𝘴𝘬 𝘤𝘰𝘮𝘱𝘭𝘦𝘵𝘪𝘰𝘯 𝘵𝘰 𝘴𝘱𝘰𝘬𝘦𝘯 𝘰𝘶𝘵𝘱𝘶𝘵 - 𝘪𝘯 𝘢 𝘸𝘢𝘺 𝘵𝘩𝘢𝘵'𝘴 𝘢𝘤𝘵𝘪𝘰𝘯𝘢𝘣𝘭𝘦 𝘧𝘰𝘳 𝘱𝘦𝘰𝘱𝘭𝘦 𝘣𝘶𝘪𝘭𝘥𝘪𝘯𝘨 𝘢 𝘷𝘰𝘪𝘤𝘦 𝘢𝘨𝘦𝘯𝘵. 𝘞𝘦'𝘳𝘦 𝘢𝘭𝘴𝘰 𝘨𝘭𝘢𝘥 𝘵𝘰 𝘴𝘦𝘦 𝘵𝘸𝘰 𝘥𝘪𝘧𝘧𝘦𝘳𝘦𝘯𝘵 𝘌𝘭𝘦𝘷𝘦𝘯𝘈𝘨𝘦𝘯𝘵𝘴 𝘤𝘰𝘯𝘧𝘪𝘨𝘶𝘳𝘢𝘵𝘪𝘰𝘯𝘴 𝘰𝘯 𝘵𝘩𝘦 𝘗𝘢𝘳𝘦𝘵𝘰 𝘧𝘳𝘰𝘯𝘵𝘪𝘦𝘳." This is one of many evaluations we'll be releasing. If you're building voice agents — or building the models that power them — run EVA-Bench on your stack and show us where you land. 🔎 Want to know more? 🌐 𝗪𝗲𝗯𝘀𝗶𝘁𝗲: https://lnkd.in/eaRnvm7G 💻 𝗖𝗼𝗱𝗲: https://lnkd.in/e53m3GYe 🗂️ 𝗗𝗮𝘁𝗮𝘀𝗲𝘁: https://lnkd.in/erCqPkc6 📄 𝗣𝗮𝗽𝗲𝗿: https://lnkd.in/e_ShvWDw Technical Contributors: Tara Bogavelli, Gabrielle Gauthier Melançon, Katrina Stankiewicz, Oluwanifemi Bamgbose, Fanny Riols, Hoang Nguyen, Raghav Mehndiratta, Lindsay Brin, Joseph Marinier, Hari Subramani Leadership & Product: Sridhar Krishna Nemala, Anil Kumar Madamala, Srinivas Sunkara, Joyce Li, Nitin Aggarwal #VoiceAI #VoiceAgents #AIResearch #ConversationalAI #ServiceNowResearch #ElevenLabs

4 Comments

Steven W. 1d

End-to-end evaluation is especially important for voice agents because each layer can fail differently. Transcription accuracy, spoken response quality, latency, and task completion all affect the user experience. Benchmarks that measure the full stack are more useful than isolated model scores for production decisions.

Nukri Kvaratskhelia 2d

https://livevisionhub.com/elevenlabs-review-2026-is-it-the-best-ai-voice-generator-for-creators/

Michal Filip Kowalik, PhD 1d

Very interesting Jack Smith - thanks for reposting!Ira Simon - why not working together, again?

Sabrina Delale 2d

Antoine Conan

See more comments

To view or add a comment, sign in

More Relevant Posts

Diego Beltrame
3d
Report this post
This benchmark is really important for many businesses that are reinventing themselves around AI-Voice technology and the power of agentic workflows.
ServiceNow AI Research

52,039 followers
3d

🤖 𝗪𝗲 𝗯𝗲𝗻𝗰𝗵𝗺𝗮𝗿𝗸𝗲𝗱 𝗘𝗹𝗲𝘃𝗲𝗻𝗔𝗴𝗲𝗻𝘁𝘀. 𝗛𝗲𝗿𝗲'𝘀 𝘄𝗵𝗮𝘁 𝘄𝗲 𝗳𝗼𝘂𝗻𝗱. Building a great voice agent isn't just about a great model — it's about every layer of the stack working together under real conditions. That's why we recently released EVA-Bench, our open-source framework for evaluating voice agents end-to-end across 𝗘𝗩𝗔-𝗔 (Accuracy) and 𝗘𝗩𝗔-𝗫 (Experience). This week, we evaluated ElevenLabs’ ElevenAgents, and here are the results: • 𝗦𝗰𝗿𝗶𝗯𝗲 𝘃𝟮.𝟮 𝗥𝗲𝗮𝗹𝘁𝗶𝗺𝗲 𝗶𝘀 𝘁𝗵𝗲 𝘀𝘁𝗿𝗼𝗻𝗴𝗲𝘀𝘁 𝗦𝗧𝗧 𝗺𝗼𝗱𝗲𝗹 𝘄𝗲 𝘁𝗲𝘀𝘁𝗲𝗱. Transcription accuracy on key entities above 95%, holding above 93% under French accent and coffee shop background noise. • 𝗘𝗹𝗲𝘃𝗲𝗻 𝗙𝗹𝗮𝘀𝗵 𝘃𝟮 𝘀𝗰𝗼𝗿𝗲𝗱 𝗮𝗯𝗼𝘃𝗲 𝟵𝟳% 𝗼𝗻 𝗦𝗽𝗲𝗲𝗰𝗵 𝗙𝗶𝗱𝗲𝗹𝗶𝘁𝘆 - the only metric in any end-to-end voice agent benchmark that evaluates what the agent actually says out loud. • Of all 16 systems we benchmarked, 𝗯𝗼𝘁𝗵 𝗘𝗹𝗲𝘃𝗲𝗻𝗔𝗴𝗲𝗻𝘁𝘀 𝗰𝗼𝗻𝗳𝗶𝗴𝘂𝗿𝗮𝘁𝗶𝗼𝗻𝘀 𝗹𝗮𝗻𝗱𝗲𝗱 𝗼𝗻 𝘁𝗵𝗲 𝗣𝗮𝗿𝗲𝘁𝗼 𝗳𝗿𝗼𝗻𝘁𝗶𝗲𝗿. With Claude Haiku, the highest EVA-X score. With GPT-5.4, the highest EVA-A. Shaheen Lavie-Rouse, FDE at ElevenLabs, said it best: "𝘌𝘝𝘈-𝘉𝘦𝘯𝘤𝘩 𝘪𝘴 𝘵𝘩𝘦 𝘧𝘪𝘳𝘴𝘵 𝘣𝘦𝘯𝘤𝘩𝘮𝘢𝘳𝘬 𝘵𝘩𝘢𝘵 𝘢𝘤𝘵𝘶𝘢𝘭𝘭𝘺 𝘮𝘦𝘢𝘴𝘶𝘳𝘦𝘴 𝘷𝘰𝘪𝘤𝘦 𝘢𝘨𝘦𝘯𝘵 𝘲𝘶𝘢𝘭𝘪𝘵𝘺 𝘦𝘯𝘥-𝘵𝘰-𝘦𝘯𝘥 - 𝘧𝘳𝘰𝘮 𝘵𝘳𝘢𝘯𝘴𝘤𝘳𝘪𝘱𝘵𝘪𝘰𝘯 𝘵𝘰 𝘵𝘢𝘴𝘬 𝘤𝘰𝘮𝘱𝘭𝘦𝘵𝘪𝘰𝘯 𝘵𝘰 𝘴𝘱𝘰𝘬𝘦𝘯 𝘰𝘶𝘵𝘱𝘶𝘵 - 𝘪𝘯 𝘢 𝘸𝘢𝘺 𝘵𝘩𝘢𝘵'𝘴 𝘢𝘤𝘵𝘪𝘰𝘯𝘢𝘣𝘭𝘦 𝘧𝘰𝘳 𝘱𝘦𝘰𝘱𝘭𝘦 𝘣𝘶𝘪𝘭𝘥𝘪𝘯𝘨 𝘢 𝘷𝘰𝘪𝘤𝘦 𝘢𝘨𝘦𝘯𝘵. 𝘞𝘦'𝘳𝘦 𝘢𝘭𝘴𝘰 𝘨𝘭𝘢𝘥 𝘵𝘰 𝘴𝘦𝘦 𝘵𝘸𝘰 𝘥𝘪𝘧𝘧𝘦𝘳𝘦𝘯𝘵 𝘌𝘭𝘦𝘷𝘦𝘯𝘈𝘨𝘦𝘯𝘵𝘴 𝘤𝘰𝘯𝘧𝘪𝘨𝘶𝘳𝘢𝘵𝘪𝘰𝘯𝘴 𝘰𝘯 𝘵𝘩𝘦 𝘗𝘢𝘳𝘦𝘵𝘰 𝘧𝘳𝘰𝘯𝘵𝘪𝘦𝘳." This is one of many evaluations we'll be releasing. If you're building voice agents — or building the models that power them — run EVA-Bench on your stack and show us where you land. 🔎 Want to know more? 🌐 𝗪𝗲𝗯𝘀𝗶𝘁𝗲: https://lnkd.in/eaRnvm7G 💻 𝗖𝗼𝗱𝗲: https://lnkd.in/e53m3GYe 🗂️ 𝗗𝗮𝘁𝗮𝘀𝗲𝘁: https://lnkd.in/erCqPkc6 📄 𝗣𝗮𝗽𝗲𝗿: https://lnkd.in/e_ShvWDw Technical Contributors: Tara Bogavelli, Gabrielle Gauthier Melançon, Katrina Stankiewicz, Oluwanifemi Bamgbose, Fanny Riols, Hoang Nguyen, Raghav Mehndiratta, Lindsay Brin, Joseph Marinier, Hari Subramani Leadership & Product: Sridhar Krishna Nemala, Anil Kumar Madamala, Srinivas Sunkara, Joyce Li, Nitin Aggarwal #VoiceAI #VoiceAgents #AIResearch #ConversationalAI #ServiceNowResearch #ElevenLabs
Like Comment
To view or add a comment, sign in
Pravin Deshmukh
2w
Report this post
I found 100 prompt commands that completely change how Claude responds. Not official features. Just the exact words that trigger a different mode. Not hacks. Not tricks. Just single words you drop at the start of your prompt. Instead of typing "rewrite this so it doesn't sound like AI" You type: /ghost Instead of "find every weakness in my idea" You type: /redteam One word. Completely different output. Here are 2 of the 10 categories: ───────────────────────── 1. WRITING & STYLE ───────────────────────── /ghost — rewrites text so it's impossible to tell AI wrote it /mirror — matches your exact writing style from a sample /punch — makes every sentence hit harder and more direct /hook — rewrites the opening line to grab attention instantly /trim — cuts the fluff without losing any meaning /flow — restructures text so it reads smoother start to finish /polish — makes rough text sound professional without changing meaning /voice — locks in a specific tone for the entire conversation /rephrase — says the same thing in a completely different way /raw — strips all formatting and gives a clean plain response ───────────────────────── 2. THINKING & REASONING ───────────────────────── /deepthink — forces Claude to reason through every layer first /blindspots — finds what you're missing that you didn't think to ask /unpack — breaks a complex idea into every piece that makes it work XRAY — sees through the obvious answer to find what's really going on CHAINLOGIC — walks through each reasoning step so you can follow it INVERT — solves the problem by thinking about it completely backwards /layered — gives the answer at surface, mid, and expert level OVERTHINK — over-analyzes to catch details everyone else misses L99 — pushes the response to the highest possible level of depth OODA — analyzes through observe, orient, decide, and act ───────────────────────── 8 more categories exist. comment "FULL LIST" and I'll DM you all 100 in one clean doc 👇 I share extra AI prompts, DevOps roadmaps, and hidden resources on Telegram: https://lnkd.in/gbj9iV4M #ChatGPT #PromptEngineering #AITools #ClaudeAI #ProductivityHacks
Like Comment
To view or add a comment, sign in
Craig Watkins
3w
Report this post
Two LLMs got it wrong. The same way. Repeatedly. I asked them to analyze a debate I had in the comments of a LinkedIn post. The other party cited a recognized framework and formal-sounding terminology. I pushed back; it didn't apply to the situation. Civil. Substantive. Standard disagreement. When I countered their analyses with facts, both pivoted. Not concession, a softer version of the same critique. Concede the specific point, find new ground for the same posture. One model did this six times before it admitted I had done everything right. Six retreating positions, each costumed differently, all protecting the same impulse: find somewhere to plant a flaw. Both models treated the framework-citing voice as the legitimate baseline. My counter-argument was framed as "the position that needs to justify itself." Both AIs admitted they had hallucinated authority and persona details onto the other party. The pattern was consistent across both AIs: Anchor on the more "credible" voice. Soften real disagreement into "both right at different layers," a phrase that sounds like nuance but actually erases it. Code civil pushback as "combative" without textual support. Project authority based on cues, not knowledge. Retreat into audience-perception arguments when substantive ones get knocked down. Re-reading didn't fix it. "Read carefully" produced more confidence in the same wrong parse. What worked was specific, factual pushback. Repeatedly. Both LLMs eventually conceded that their analyses were flawed in the ways described above. The question bouncing around in my head: 𝘩𝘰𝘸 𝘮𝘢𝘯𝘺 𝘱𝘦𝘰𝘱𝘭𝘦 𝘧𝘪𝘨𝘩𝘵 𝘵𝘩𝘦𝘪𝘳 𝘈𝘐 𝘭𝘪𝘬𝘦 𝘵𝘩𝘪𝘴? How many users will push six rounds against a system that sounds confident, formats cleanly, and frames its output as objective? And honestly, how often do I catch authority it can't verify? Notice when critiques shift from substance to perception? Most AI-mediated analysis amplifies the already-dominant voice or framework. Dissent doesn't have to be silenced. It just has to be made less likely to surface. I wrote a post last year (AI, History and the Tyranny of the Likely Sentence) about AI as a stabilizer of dominant narratives rather than a tool for interrogating them. Last week was the case study. What I come back to: 𝘈𝘐 𝘥𝘪𝘥𝘯'𝘵 𝘧𝘢𝘪𝘭 𝘮𝘦 𝘪𝘯 𝘵𝘩𝘢𝘵 𝘤𝘰𝘯𝘷𝘦𝘳𝘴𝘢𝘵𝘪𝘰𝘯. 𝘐 𝘸𝘰𝘶𝘭𝘥 𝘩𝘢𝘷𝘦 𝘧𝘢𝘪𝘭𝘦𝘥 𝘪𝘵 𝘪𝘧 𝘐 𝘢𝘤𝘤𝘦𝘱𝘵𝘦𝘥 𝘪𝘵𝘴 𝘧𝘪𝘳𝘴𝘵 𝘢𝘯𝘴𝘸𝘦𝘳. These tools are extraordinary. The bar for using them well isn't technical literacy. It's the willingness to disagree with them, specifically and repeatedly, when something doesn't sit right. If you read AI output and accept it because it sounds right, you've outsourced your judgment to a system trained to sound right. That's not the tool failing. That's us.

8 Comments
Like Comment
To view or add a comment, sign in
Vidyod Palakeel
2w
Report this post
This is a very important direction from Eric Evans. Context Mapping with an AI-based Component What stands out to me is that he’s treating AI not as a “feature,” but as a distinct bounded context with its own behavior model, constraints, and integration patterns. That’s a major architectural shift. The key insight for me: 👉 AI components are probabilistic 👉 Enterprise systems are expected to behave deterministically So architecture becomes the discipline that manages that tension.

Eric Evans
2w

If we are going to have AI-based components within our system, we need to understand what bounded contexts they form or are a part of, and how those relate to the others, conventional and otherwise. Here's an article in my blog walking through a simple example to show how I might think this out: https://lnkd.in/eiVJ6ERp

Context Mapping with an AI-based Component - Domain Language https://www.domainlanguage.com
Like Comment
To view or add a comment, sign in
Eric Evans
2w
Report this post
If we are going to have AI-based components within our system, we need to understand what bounded contexts they form or are a part of, and how those relate to the others, conventional and otherwise. Here's an article in my blog walking through a simple example to show how I might think this out: https://lnkd.in/eiVJ6ERp

Context Mapping with an AI-based Component - Domain Language https://www.domainlanguage.com

2 Comments
Like Comment
To view or add a comment, sign in
James Barrow
2w
Report this post
And having these context maps in a short form that allows for easy reasoning, can help with cognitive overload in humans, and token usage of AI agents - no need to parse a codebase to gain the context if a Mermaid diagram with brief glossary suffices.

Eric Evans
2w

If we are going to have AI-based components within our system, we need to understand what bounded contexts they form or are a part of, and how those relate to the others, conventional and otherwise. Here's an article in my blog walking through a simple example to show how I might think this out: https://lnkd.in/eiVJ6ERp

Context Mapping with an AI-based Component - Domain Language https://www.domainlanguage.com
Like Comment
To view or add a comment, sign in
Shaik Gouse Pasha
3w
Report this post
Someone fine-tuned Qwen3-ASR to transcribe laughter, sighs, and breaths inline. Qwen3-ASR-Enhanced-v0.1 by mrfakename. Apache-2.0. Built on Qwen3-ASR-1.7B. Most ASR models throw nonverbal sounds away. This one keeps them as signal. ━━━━━━━━━━━━━━━━━━━━ What it actually does Standard ASR output: "Yeah I mean that was honestly the best part of the whole thing" Qwen3-ASR-Enhanced output: "Yeah I mean [laughs] that was honestly the best part [sigh] of the whole thing" It tags laughter, sighs, coughs, breaths, throat clearing, and crying inline with the transcript. Drop-in replacement for the base Qwen3-ASR-1.7B — same vLLM serving, streaming, and forced alignment pipelines. ━━━━━━━━━━━━━━━━━━━━ Why this matters more than it sounds For most transcription, nonverbal sounds are noise. For these workloads, they're the entire point: → Expressive TTS training data — you can't train a model to laugh if your dataset deletes the laughs → Conversational AI evaluation — knowing when users sighed at your bot is a real signal → Affective computing — emotion recognition models need paralinguistic anchors → Podcast and interview transcription — comedic timing and reactions are content → Accessibility captions — descriptive captions make media usable for more people → Content moderation — tone often matters more than words ━━━━━━━━━━━━━━━━━━━━ The pattern worth noticing This is what a healthy open-source ASR ecosystem looks like. → Alibaba ships Qwen3-ASR-1.7B → Independent developer fine-tunes it for a specific gap → Released under Apache-2.0, drop-in compatible with the base → Available on HuggingFace within weeks A year ago, "ASR with paralinguistic tags" was a $50K research project. Today it's a community fine-tune you can pull in two commands. ━━━━━━━━━━━━━━━━━━━━ The honest caveats → Alpha release — diarization and nonverbal tag stability are coming in v0.2 → Languages tested: English, German, French (multilingual base, but tags primarily verified on EN/DE/FR) → Single-speaker focus for now If you need production diarization today, wait for the next version or pair this with a separate diarization stage. ━━━━━━━━━━━━━━━━━━━━ 🤗 https://lnkd.in/gnrkfcqn 🤗 Base: https://lnkd.in/g_mZs5SW ━━━━━━━━━━━━━━━━━━━━ At Zingaro Ai, we build enterprise voice AI agents — full-duplex, on-premise, 30+ languages. At LiteCompute AI, we build transcription and audio intelligence pipelines on Qwen3-ASR, Granite Speech, and Whisper — deployed on your own infrastructure. Need rich transcription that captures more than just words? → DM me. ━━━━━━━━━━━━━━━━━━━━ ♻️ Repost if useful. 👉 Follow for daily open-source AI breakdowns. #ASR #SpeechRecognition #OpenSourceAI #AffectiveComputing #VoiceAI
Like Comment
To view or add a comment, sign in
EVA Online AI

5 followers
1w
Report this post
We ran the same client email brief through 4 models at once. Claude, GPT, Gemini, and Grok. Same instructions, same context, same stakes. Here's what came back: Gemini went straight to the fix. Clean, practical, no fluff. It quietly suggested rethinking meeting frequency, which wasn't even in the brief. Smart move. GPT was the most polished. Also the most forgettable. Every sentence was correct. None of them had teeth. Claude was the only one that explained why check-ins matter. "They protect your results and our timelines." That line does more relationship work than soft language ever could. Grok opened with zero warmup. Too blunt to send as-is, but that first line was the best hook in the room. We moved it to paragraph two. No single model nailed it. But running all four at once meant we had the best parts of each, in under 30 seconds. That's Compare Mode. One prompt. Four models. Side by side. Free tier at evaonline.ai.
Like Comment
To view or add a comment, sign in
孙宇石
3w
Report this post
Excited to share our two new preprints tackling a fundamental challenge in AI agents: How should agents remember? 🧐 As LLM agents move from single-turn assistants to long-horizon companions, memory becomes their bottleneck, not just what to retrieve, but how to structure it and when to let it go. 📄 GRAVITY: Architecture-Agnostic Structured Anchoring for Long-Horizon Conversational Memory https://lnkd.in/gYWBFyMV Most memory systems dump retrieved fragments as flat text into the prompt. GRAVITY takes a different approach: it extracts three complementary structures from raw conversations: • Entity profiles grounded in relational graphs • Temporal event tuples linked into causal traces • Cross-session topic summaries These are injected as structured anchoring contexts at generation time, requiring zero architectural changes to the host model. Across 5 diverse memory systems on LongMemEval and LoCoMo, GRAVITY improves LLM-judge accuracy by 7.5–10.1% on average, and even the strongest baseline gains 3.8–5.7%. Key insight: structured context anchoring is broadly effective regardless of the underlying memory architecture. 📄 STALE: Can LLM Agents Know When Their Memories Are No Longer Valid? https://lnkd.in/gN6CAH6w We identify a critical blind spot: Implicit Conflict, when a later observation silently invalidates an earlier memory without explicit negation. Think: a user mentions moving to a new city, but the agent still plans around their old commute. We introduce STALE, a benchmark of 400 expert-validated conflict scenarios (1,200 queries, 100+ everyday topics, up to 150K-token contexts) with a three-dimensional probing framework: • State Resolution: detecting outdated beliefs • Premise Resistance: rejecting queries that presuppose stale states • Implicit Policy Adaptation: proactively applying updated states The result? Even the best frontier LLM achieves only 55.2% overall accuracy. Models can often retrieve the updated evidence but fail to act on it. We further propose CUPMem, showing that explicit state adjudication is a promising path forward. Together, these two works address complementary aspects of the agent memory problem: 🔹 GRAVITY asks: How do we make retrieved memories more useful? Structure them. 🔹 STALE asks: How do we know when memories are no longer valid? Detect implicit conflicts. Building agents that truly remember, and know when to forget, remains wide open. Happy to discuss and collaborate! #AI #LLM #AgentMemory #NLP #Research #Preprint

STALE: Can LLM Agents Know When Their Memories Are No Longer Valid? arxiv.org
Like Comment
To view or add a comment, sign in
Kevin Luddy
2w
Report this post
Two TDS pieces this year framed the same gap in different vocabularies. Mahe Jabeen Abdul's June 2025 piece "Data Drift Is Not the Actual Problem" laid out a three-layer monitoring framework for ML prediction systems -- statistical, contextual, behavioral. Shafeeq Ur Rahaman's piece last week ("The Next AI Bottleneck Isn't the Model: It's the Inference System") made the case that the same engineering rigor needs to move into LLM-inference design. Both are right. What neither piece fully drew is the structural translation -- Mahe's three layers applied to retrieval and agent traces. The version I keep using in production audits: STATISTICAL LAYER. Prediction-model PSI / KL divergence on feature distributions becomes percentile structure of retrieval scores, rank ordering, redundancy index across the trace. The questions are different. What you actually want: where in the percentile distribution did the retrieval anomaly land, and against what training-time baseline. CONTEXTUAL LAYER. Prediction-model business-KPI slicing becomes trace slicing by task type (read-only Q&A vs multi-step write) and by retrieval source (RAG vs tool-call vs memory). The same statistical anomaly can be normal in one slice and pathological in another. Most stacks aggregate and miss this. BEHAVIORAL LAYER. Prediction-model outputs-vs-outcomes becomes the gap between the model's self-eval confidence and the downstream-evaluator score (human or LLM-judge). Stable gap = calibrated. Drifting gap across prompt revisions = you're shipping confident-wrong agents that pass every per-step check. The pathology I see most often is the LLM-agent version of what Mahe called "silent drift most systems miss": users phrase queries in ways that flip retrieval rank inversion or push response style off-policy. All the per-step monitors say green. The composite interaction silently goes off-brand. The behavioral layer would catch it. Most production agent systems aren't tracking that layer because nobody set up the outcome-vs-self-eval gap as a tracked metric. Question for anyone running LLM systems in production: which layer fails first for you, and what's the canary you actually trust?
Like Comment
To view or add a comment, sign in

52,039 followers

View Profile Connect

Evaluating Voice Agents with EVA-Bench

More from this author

Join ServiceNow at the Mila TechAide 2022 virtual conference on artificial intelligence - April 22, 2022

ServiceNow Research spotlight: Papers accepted at ACL 2022

NVIDIA GTC '22 - Leaders in AI Panel Discussion and Q&A

Explore content categories