LLM Evaluation Methods Beyond String Matching

building AI systems @meta

207,068 followers 11mo

Explaining the Evaluation method LLM-as-a-Judge (LLMaaJ). Token-based metrics like BLEU or ROUGE are still useful for structured tasks like translation or summarization. But for open-ended answers, RAG copilots, or complex enterprise prompts, they often miss the bigger picture. That’s where LLMaaJ changes the game. 𝗪𝗵𝗮𝘁 𝗶𝘀 𝗶𝘁? You use a powerful LLM as an evaluator, not a generator. It’s given: - The original question - The generated answer - And the retrieved context or gold answer 𝗧𝗵𝗲𝗻 𝗶𝘁 𝗮𝘀𝘀𝗲𝘀𝘀𝗲𝘀: ✅ Faithfulness to the source ✅ Factual accuracy ✅ Semantic alignment—even if phrased differently 𝗪𝗵𝘆 𝘁𝗵𝗶𝘀 𝗺𝗮𝘁𝘁𝗲𝗿𝘀: LLMaaJ captures what traditional metrics can’t. It understands paraphrasing. It flags hallucinations. It mirrors human judgment, which is critical when deploying GenAI systems in the enterprise. 𝗖𝗼𝗺𝗺𝗼𝗻 𝗟𝗟𝗠𝗮𝗮𝗝-𝗯𝗮𝘀𝗲𝗱 𝗺𝗲𝘁𝗿𝗶𝗰𝘀: - Answer correctness - Answer faithfulness - Coherence, tone, and even reasoning quality 📌 If you’re building enterprise-grade copilots or RAG workflows, LLMaaJ is how you scale QA beyond manual reviews. To put LLMaaJ into practice, check out EvalAssist; a new tool from IBM Research. It offers a web-based UI to streamline LLM evaluations: - Refine your criteria iteratively using Unitxt - Generate structured evaluations - Export as Jupyter notebooks to scale effortlessly A powerful way to bring LLM-as-a-Judge into your QA stack. - Get Started guide: https://lnkd.in/g4QP3-Ue - Demo Site: https://lnkd.in/gUSrV65s - Github Repo: https://lnkd.in/gPVEQRtv - Whitepapers: https://lnkd.in/gnHi6SeW

56 Comments

Kuldeep Singh Sidhu

Senior Data Scientist @ Walmart | BITS Pilani

16,493 followers 10mo

Evaluating Retrieval-Augmented Generation (RAG) systems has long been a challenge, given the complexity and subjectivity of long-form responses. A recent collaborative research paper from institutions including the University of Waterloo, Microsoft, and Snowflake presents a promising solution: the AutoNuggetizer framework. This innovative approach leverages Large Language Models (LLMs) to automate the "nugget evaluation methodology," initially proposed by TREC in 2003 for assessing responses to complex questions. Here's a technical breakdown of how it works under the hood: 1. Nugget Creation: - Initially, LLMs automatically extract "nuggets," or atomic pieces of essential information, from a set of related documents. - Nuggets are classified as "vital" (must-have) or "okay" (nice-to-have) based on their importance in a comprehensive response. - An iterative prompt-based approach using GPT-4o ensures the nuggets are diverse and cover different informational facets. 2. Nugget Assignment: - LLMs then automatically evaluate each system-generated response, assigning nuggets as "support," "partial support," or "no support." - This semantic evaluation allows the model to recognize supported facts even without direct lexical matching. 3. Evaluation and Correlation: - Automated evaluation scores strongly correlated with manual evaluations, particularly at the system-run level, suggesting this methodology could scale efficiently for broad usage. - Interestingly, the automation of nugget assignment alone significantly increased alignment with manual evaluations, highlighting its potential as a cost-effective evaluation approach. Through rigorous validation against human annotations, the AutoNuggetizer framework demonstrates a practical balance between automation and evaluation quality, providing a scalable, accurate method to advance RAG system evaluation. The research underscores not just the potential of automating complex evaluations, but also opens avenues for future improvements in RAG systems.

Sohrab Rahimi

Director, AI/ML Lead @ Google

23,844 followers 11mo

Evaluating LLMs is hard. Evaluating agents is even harder. This is one of the most common challenges I see when teams move from using LLMs in isolation to deploying agents that act over time, use tools, interact with APIs, and coordinate across roles. These systems make a series of decisions, not just a single prediction. As a result, success or failure depends on more than whether the final answer is correct. Despite this, many teams still rely on basic task success metrics or manual reviews. Some build internal evaluation dashboards, but most of these efforts are narrowly scoped and miss the bigger picture. Observability tools exist, but they are not enough on their own. Google’s ADK telemetry provides traces of tool use and reasoning chains. LangSmith gives structured logging for LangChain-based workflows. Frameworks like CrewAI, AutoGen, and OpenAgents expose role-specific actions and memory updates. These are helpful for debugging, but they do not tell you how well the agent performed across dimensions like coordination, learning, or adaptability. Two recent research directions offer much-needed structure. One proposes breaking down agent evaluation into behavioral components like plan quality, adaptability, and inter-agent coordination. Another argues for longitudinal tracking, focusing on how agents evolve over time, whether they drift or stabilize, and whether they generalize or forget. If you are evaluating agents today, here are the most important criteria to measure: • 𝗧𝗮𝘀𝗸 𝘀𝘂𝗰𝗰𝗲𝘀𝘀: Did the agent complete the task, and was the outcome verifiable? • 𝗣𝗹𝗮𝗻 𝗾𝘂𝗮𝗹𝗶𝘁𝘆: Was the initial strategy reasonable and efficient? • 𝗔𝗱𝗮𝗽𝘁𝗮𝘁𝗶𝗼𝗻: Did the agent handle tool failures, retry intelligently, or escalate when needed? • 𝗠𝗲𝗺𝗼𝗿𝘆 𝘂𝘀𝗮𝗴𝗲: Was memory referenced meaningfully, or ignored? • 𝗖𝗼𝗼𝗿𝗱𝗶𝗻𝗮𝘁𝗶𝗼𝗻 (𝗳𝗼𝗿 𝗺𝘂𝗹𝘁𝗶-𝗮𝗴𝗲𝗻𝘁 𝘀𝘆𝘀𝘁𝗲𝗺𝘀): Did agents delegate, share information, and avoid redundancy? • 𝗦𝘁𝗮𝗯𝗶𝗹𝗶𝘁𝘆 𝗼𝘃𝗲𝗿 𝘁𝗶𝗺𝗲: Did behavior remain consistent across runs or drift unpredictably? For adaptive agents or those in production, this becomes even more critical. Evaluation systems should be time-aware, tracking changes in behavior, error rates, and success patterns over time. Static accuracy alone will not explain why an agent performs well one day and fails the next. Structured evaluation is not just about dashboards. It is the foundation for improving agent design. Without clear signals, you cannot diagnose whether failure came from the LLM, the plan, the tool, or the orchestration logic. If your agents are planning, adapting, or coordinating across steps or roles, now is the time to move past simple correctness checks and build a robust, multi-dimensional evaluation framework. It is the only way to scale intelligent behavior with confidence.

25 Comments

Cameron R. Wolfe, Ph.D.

Research @ Netflix

24,103 followers 10mo

LLM-as-a-Judge (LaaJ) and reward models (RMs) are similar concepts, but understanding their nuanced differences is important for applying them correctly in practice… LLM-as-a-Judge is a reference-free evaluation metric that assesses model outputs by simply prompting a powerful language model to perform the evaluation for us. In the standard setup, we ask the model to either: - Provide a direct assessment score (e.g., binary or Likert score) of a model’s output. - Compare the relative quality of multiple outputs (i.e., pairwise scoring). There are many choices for the LLM judge we use. For example, we can use an off-the-shelf foundation model, fine-tune our own model, or form a "jury" of several LLM judges. Reward models are specialized LLMs��usually derived from the LLM we are currently training—that are trained to predict a human preference score given a prompt and a candidate completion as input. A higher score from the RM indicates higher human preference. Similarities between LaaJ and RMs: Both LaaJ and RMs can provide direct assessment and pairwise (preference) scores. Therefore, both techniques can be used for evaluation. Given these similarities, recent research has explored combining RMs and LaaJ into a single model with both capabilities. Differences between LaaJ and RMs: Despite their surface similarities, these two techniques have many fundamental differences: - RMs are fine-tuned using a preference learning or ranking objective, whereas fine-tuned LaaJ models usually learn via standard language modeling objectives. - LaaJ models are often based on off-the-shelf or foundation LLMs, whereas RMs are always fine-tuned. - LaaJ is based on a standard LLM architecture, while RMs typically add an additional classification head to predict a preference score. - RMs only score single model outputs (though we can derive a preference score by plugging multiple RM scores into a preference model like Bradley-Terry), whereas LaaJ can support arbitrary scoring setups (i.e., is more flexible). Where should we use each technique? Given these differences, recent research has provided insights into where LaaJ and RMs are most effective. LaaJ should be used for evaluation purposes (both direct assessment and pairwise). This is an incredibly powerful evaluation technique that is used almost universally. When we compare the evaluation accuracy of LaaJ (assuming correct setup and tuning) to RMs, LaaJ models tend to have superior scoring accuracy; for example, in RewardBench2, LaaJ models achieve the highest accuracy on pairwise preference scoring. Despite LaaJ’s strengths, RMs are still more useful for RL-based training with LLMs (e.g., PPO-based RLHF). Interestingly, even though LaaJ models provide more accurate preference scores, they cannot be directly used as RMs for RL training. It is important that the RM is derived from the policy currently being trained, meaning we must train a custom RM based on our current policy for RLHF to work properly.

10 Comments

Mayank A.

Follow for Your Daily Dose of AI, Software Development & System Design Tips | Exploring AI SaaS - Tinkering, Testing, Learning | Everything I write reflects my personal thoughts and has nothing to do with my employer. 👍

176,185 followers 7mo

We've all shipped an LLM feature that "felt right" in dev, only to watch it break in production. Why? Because human "eyeballing" isn't a scalable evaluation strategy. The real challenge in building robust AI isn't just getting an LLM to generate an output. It’s ensuring the output is 𝐫𝐢𝐠𝐡𝐭, 𝐬𝐚𝐟𝐞, 𝐟𝐨𝐫𝐦𝐚𝐭𝐭𝐞𝐝, 𝐚𝐧𝐝 𝐮𝐬𝐞𝐟𝐮𝐥, consistently, across thousands of diverse user inputs. This is where 𝐄𝐯𝐚𝐥𝐮𝐚𝐭𝐢𝐨𝐧 𝐌𝐞𝐭𝐫𝐢𝐜𝐬 become non-negotiable. Think of them as the sophisticated unit tests and integration tests for your LLM's brain. You need to move beyond "does it work?" to "how well does it work, and why?" This is precisely what Comet's 𝐎𝐩𝐢𝐤 is designed for. It provides the framework to rigorously grade your LLM's performance, turning subjective feelings into objective data. Here's how we approach it, as shown in the cheat sheet below: 1./ Heuristic Metrics => the 'Linters' & 'Unit Tests' - These are your non-negotiable, deterministic sanity checks. - They are low-cost, fast, and catch objective failures. - Your pipeline should fail here first. ▫️Is it valid? → IsJson, RegexMatch ▫️Is it faithful? → Contains, Equals ▫️Is it close? → Levenshtein 2./ LLM-as-a-Judge => the 'Peer Review' - This is for everything that "looks right" but might be subtly wrong. - These metrics evaluate quality and nuance where statistical rules fail. - They answer the hard, subjective questions. ▫️Is it true? → Hallucination ▫️Is it relevant? → AnswerRelevance ▫️Is it helpful? → Usefulness 3./ G-Eval => the dynamic 'Judge-Builder' - G-Eval is a task-agnostic LLM-as-a-Judge. - You define custom evaluation criteria in plain English (e.g., "Is the tone professional but not robotic?"). - It then uses Chain-of-Thought reasoning internally to analyze the output and produce a human-aligned score for those criteria. - This allows you to test specific business logic without writing new code. 4./ Custom Metrics - For everything else. - This is where you write your own Python code to create a metric. - It’s for when you need to check an output against a live internal API, a proprietary database, or any other logic that only your system knows. Take a look at the cheat sheet for a quick breakdown. Which metric are you implementing first for your current LLM project? ♻️ Don't forget to repost.

135 Comments

Aishwarya Srinivasan

633,658 followers 11mo

Most people still think of LLMs as “just a model.” But if you’ve ever shipped one in production, you know it’s not that simple. Behind every performant LLM system, there’s a stack of decisions, about pretraining, fine-tuning, inference, evaluation, and application-specific tradeoffs. This diagram captures it well: LLMs aren’t one-dimensional. They’re systems. And each dimension introduces new failure points or optimization levers. Let’s break it down: 🧠 Pre-Training Start with modality. → Text-only models like LLaMA, UL2, PaLM have predictable inductive biases. → Multimodal ones like GPT-4, Gemini, and LaVIN introduce more complex token fusion, grounding challenges, and cross-modal alignment issues. Understanding the data diet matters just as much as parameter count. 🛠 Fine-Tuning This is where most teams underestimate complexity: → PEFT strategies like LoRA and Prefix Tuning help with parameter efficiency, but can behave differently under distribution shift. → Alignment techniques- RLHF, DPO, RAFT, aren’t interchangeable. They encode different human preference priors. → Quantization and pruning decisions will directly impact latency, memory usage, and downstream behavior. ⚡️ Efficiency Inference optimization is still underexplored. Techniques like dynamic prompt caching, paged attention, speculative decoding, and batch streaming make the difference between real-time and unusable. The infra layer is where GenAI products often break. 📏 Evaluation One benchmark doesn’t cut it. You need a full matrix: → NLG (summarization, completion), NLU (classification, reasoning), → alignment tests (honesty, helpfulness, safety), → dataset quality, and → cost breakdowns across training + inference + memory. Evaluation isn’t just a model task, it’s a systems-level concern. 🧾 Inference & Prompting Multi-turn prompts, CoT, ToT, ICL, all behave differently under different sampling strategies and context lengths. Prompting isn’t trivial anymore. It’s an orchestration layer in itself. Whether you’re building for legal, education, robotics, or finance, the “general-purpose” tag doesn’t hold. Every domain has its own retrieval, grounding, and reasoning constraints. ------- Follow me (Aishwarya Srinivasan) for more AI insight and subscribe to my Substack to find more in-depth blogs and weekly updates in AI: https://lnkd.in/dpBNr6Jg

74 Comments

Andriy Burkov

PhD in AI, author of 📖 The Hundred-Page Language Models Book and 📖 The Hundred-Page Machine Learning Book

488,040 followers 1mo

For the past few years, the standard recipe for fine-tuning LLMs on tasks like math reasoning has been reinforcement learning (RL): you let the model generate answers, score them, and use the scores to nudge the model's parameters via gradients. RL has known weaknesses here—it struggles when rewards only arrive at the end of long answers, it often "hacks" the reward by finding degenerate shortcuts, and two runs with identical settings can end up with very different final performance. This paper shows that an old and much simpler family of methods, called evolution strategies, works well on models with billions of parameters, which most researchers had assumed was impossible. The method is straightforward: take the model, make thirty copies with small random noise added to every parameter, score each copy on the task, then shift the original parameters slightly toward the copies that scored higher. No gradients, no backpropagation, no value networks, no penalty terms to tune. Using this approach, the authors finetune models from the Qwen and Llama families on a symbolic arithmetic puzzle, several math benchmarks, and Sudoku, and match or beat welltuned RL baselines while keeping the same hyperparameters across every experiment. Read with an AI tutor: https://lnkd.in/eYHcMrmG PDF: https://lnkd.in/ek5kXmww

11 Comments

Vaibhava Lakshmi Ravideshik

Research Lead @ Massachussetts Institute of Technology - Kellis Lab | LinkedIn Learning Instructor | Author - “Charting the Cosmos: AI’s expedition beyond Earth” | TSI Astronaut Candidate

20,557 followers 1y

🤖 The Agent-as-a-Judge evaluation framework for AI systems 🤖 What is it? Agent-as-a-Judge is a novel framework that uses AI agents to evaluate other AI systems. Unlike traditional methods, it goes beyond just looking at final outcomes and delves into how these systems actually make decisions and solve problems. Why is it needed? Most current evaluations only look at the final product, missing the vital steps in the middle. This is like grading a student's final exam but never checking their homework or class participation. Moreover, having humans do the evaluations can be expensive, time-consuming, and sometimes inconsistent due to subjective opinions. How does it work? At its core, Agent-as-a-Judge integrates several specialized skills such as graph building, locating files, retrieving information, and checking requirements. It uses these skills to evaluate tasks from start to finish with the help of the Dev AI benchmarking dataset, which consists of 55 real-world AI tasks. This approach gives a full picture of how an AI system works through every step, offering insights often ignored by conventional methods. Why "Agent-as-a-judge"? LLM-as-a-Judge vs. Agent-as-a-Judge: The traditional LLM-as-a-Judge approach evaluates AI systems mainly by looking at their final outputs, much like an exam result. Agent-as-a-Judge not only looks at these outputs but also evaluates how the AI got there, providing feedback on every stage of the process. This means it's like monitoring both the journey and the destination. Intermediate Feedback: Agent-as-a-Judge provides rich, ongoing feedback during the task-solving process, much like a teacher guiding a student through each step of a math problem, not just checking the final answer. System Complexity: While LLM-as-a-Judge focuses on static inputs and outputs, Agent-as-a-Judge uses multiple tools to get a holistic view, assessing not just what the AI does but how it does it. Challenges and opportunities: Although Agent-as-a-Judge is promising, it's important to note some challenges like optimizing its components and testing its adaptability beyond just coding tasks. Also, combining its strengths with other methods (like enhancing LLMs with retrieval skills) could create a powerful hybrid approach to AI evaluation. What’s next? Agent-as-a-Judge opens up exciting new possibilities for AI evaluation. As we refine this method, we pave the way for potentially phasing out human evaluations entirely. Link to the paper -> https://lnkd.in/gfYrXpHt #AI #Innovation #AgentAsAJudge #DevAI #AIDevelopment #MachineLearning

12 Comments

Asankhaya Sharma

7,310 followers 1y

🚀 Introducing the Generate README Eval Today we unveil a new evaluation method that challenges LLMs to summarize entire GitHub repositories into comprehensive README files – a task that demands deep understanding of complex codebases and the ability to synthesize information effectively. What sets this benchmark apart is its holistic approach to evaluation. We've gone beyond traditional NLP metrics like BLEU and ROUGE scores, incorporating critical dimensions such as structural similarity, code consistency, readability (using the Flesch Reading Ease Score), and information retrieval. This multifaceted evaluation provides a more nuanced and practical assessment of an LLM's capabilities in real-world scenarios. Our initial findings are interesting: The current state-of-the-art performer is Gemini-1.5-Flash-Exp-0827, showcasing impressive capabilities across various metrics. We have uncovered an interesting trade-off: as we increase the context length to include more examples (for few-shot evaluation), we see a decline in information retrieval and readability scores. This suggests that LLMs struggle with perfect recall in larger contexts, potentially missing crucial information – a critical insight for those working on improving model performance at scale. The benchmark is designed to handle repositories up to 100k tokens in size, allowing for comprehensive evaluation while remaining within the context limits of most frontier LLMs. This design choice enables us to test models on real-world, substantial codebases, providing insights that are directly applicable to practical scenarios. Explore the benchmark here: https://lnkd.in/gMn5dehf #AI #MachineLearning #Benchmarks #NLP #AIEvaluation #LargeLanguageModels

patched-codes/generate-readme-eval · Datasets at Hugging Face huggingface.co

Jeremy Arancio

ML Engineer | Document AI Specialist | Turn enterprise-scale documents into profitable data products

13,830 followers 8mo

LLM-as-a-Judge is a dangerous way to do evaluation. In most cases, it just doesn't work. While it gives a false impression of having a grasp on your system's performance, it lures you with general metrics such as correctness, faithfulness, or completeness. They hide several complexities: - What does "completeness" mean for your application? In the case of a marketing AI assistant, what characterizes a complete post from an incomplete one? If the score goes higher, does it mean that the post is better? - Often, these metrics are scores between 1-5. But what is the difference between a 3 and a 4? Does a 5 mean the output is perfect? What is perfect, then? - If you "calibrate" the LLM-as-a-judge with scores given by users during a test session, how do you ensure the LLM scoring matches user expectations? If I arbitrarily set all scores to 4, will I perform better than your model? However, if LLM-as-a-judge is limited, it doesn't mean it's impossible to evaluate your AI system. Here are some strategies you can adopt: - Online evaluation is the new king in the GenAI era Log and trace LLM outputs, retrieved chunks, routing… Each step of the process. Link it to user feedback as binary classification: was the final output good or bad? Then take a look at the data. Yourself, no LLM. Take notes and try to find patterns among good and bad traces. At this stage, you can use an LLM to help you find clusters within your notes, not the data. After taking this time, you'll already have some clues about how to improve the system. - Evaluate deterministic steps that come before the final output Was the retrieval system performant? Did the router correctly categorize the initial query? Those steps in the agentic system are deterministic, meaning you can evaluate them precisely: Retriever: Hit Rate@k, Mean Reciprocal Rank, Precision, Recall Router: Precision, Recall, F1-Score Create a small benchmark, synthetically or not, to evaluate offline those steps. It enables you to improve them later on individually (hybrid search instead of vector search, fine-tune a small classifier instead of relying on LLMs…) - Don't use tools that promise to externalize evaluation Your evaluation system is your moat. If you externalize it, not only will you have no control over your AI application, but you also won't be able to improve it. This is your problem, your users, your revenue. Evaluate your system. Not a generic one. All problems are different. Yours is unique as well. ... Those are some unequivocal ideas proposed by the AI community. Yet, I still see AI projects relying on LLM-as-a-judge and generic metrics among companies. Being able to evaluate your system gives you the power to improve it. So take the time to create the perfect evaluation for your use case.

40 Comments

LLM Evaluation Methods Beyond String Matching

More in AI Evaluation Methods

Explore categories