It is only rarely that, after reading a research paper, I feel like giving the authors a standing ovation. But I felt that way after finishing Direct Preference Optimization (DPO) by Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher Manning, and Chelsea Finn. This beautiful paper proposes a much simpler alternative to RLHF (reinforcement learning from human feedback) for aligning language models to human preferences. RLHF has been a key technique for training LLMs. In brief, RLHF (i) Gets humans to specify their preferences by ranking LLM outputs, (ii) Trains a reward model (used to score LLM outputs) -- typically represented using a transformer network -- to be consistent with the human preferences, (iii) Uses reinforcement learning to tune an LLM, also represented as a transformer, to maximize rewards. This requires two transformer networks, and RLHF is also finicky to the choice of hyperparameters. DPO simplifies the whole thing. Via clever mathematical insight, the authors show that given an LLM, there is a specific reward function for which that LLM is optimal. DPO then trains the LLM directly to make the reward function (that’s now implicitly defined by the LLM) consistent with the human rankings. So you no longer need to deal with a separately represented reward function -- you just need the LLM transformer -- and you can train the LLM directly and more efficiently to optimize the same objective as RLHF. Although it’s still too early to be sure, I am cautiously optimistic that DPO will have a huge impact on LLMs and beyond in the next few years. I write more about this in The Batch (linked to below). https://lnkd.in/gteaE2z8 You can also read the paper here: https://lnkd.in/gJ-hx7wm
AI Evaluation Methods
Explore top LinkedIn content from expert professionals.
-
-
How do we know if we’re actually becoming an AI-first company? That’s the question two customers asked me this week—and it’s a really fair one. AI buzz is everywhere, but how do you know if you’re making real progress? Here are 5 metrics every company should track to measure whether they’re truly on the path to becoming AI-first: 1. Revenue per Employee (Lagging Indicator) The ultimate test of success with AI: are you generating more value for every employee you hire? AI should amplify output, not just automate tasks. When each person drives more revenue, you know productivity is compounding. 👉 It's the north star, but it takes time to move. 2. Customer Satisfaction (CSAT) (Lagging Indicator) AI-driven productivity is meaningless if customer experience suffers. CSAT should hold steady—or better yet, improve—as AI delivers faster, smarter, more personalized service. 👉 If it drops, your AI strategy is likely misaligned with customer needs. 3. % of Teams with Access to AI Tools (Leading Indicator) You can’t be AI-first if your teams aren’t equipped. Measure how many employees have easy access to approved AI tools and whether those tools are embedded in their daily workflow. 👉 Access is the foundation. No access, no adoption. 4. Active AI Usage (Daily/Weekly) by Team (Leading Indicator) This is where the rubber meets the road. Track actual usage. Who’s using AI every day or week? What teams are lagging behind? 👉 To be AI-first, every team should be using AI every week—if not every day. 5. % of Work Carried Out by Agents (by Function) (Leading Indicator) This is the most transformational shift. What % of your team’s output is now driven by agents or AI copilots? In marketing, it could be content drafting. In sales, meeting booking. In support, ticket resolution. 👉 When agents do the work, your people focus on higher-leverage thinking—and the flywheel starts turning. Bottom line: Becoming AI-first isn't about buying tools, it’s about changing how work gets done. When you combine these 5 metrics, you get a clear picture of progress—and the compounding path toward higher productivity, better outcomes, and real transformation. What would you add to the list?
-
You're a #CTO. Your board asks: "What's our ROI on AI coding tools?" Your answer: "40% of our code is AI-generated!" They respond: "So what? Are we shipping faster? Are customers happier?" Most CTOs are measuring AI impact completely wrong. Here's what some are tracking: - Percentage of AI-generated code - Developer hours saved per week - Lines of code produced - AI tool adoption rates These metrics are like measuring how fast your assembly line workers attach parts while ignoring whether your cars actually start. Here's what you SHOULD measure instead: 1. Delivered business value 2. Customer cycle time 3. Development throughput 4. Quality and reliability 5. Total cost of delivery (not just development) 6. Team satisfaction Software development isn't a typing competition—it's a complex system. If AI makes your developers 30% faster but your deployment takes 2 weeks and QA adds another week, your customer delivery improves by maybe 7%. You've speed up the wrong part. The solution: A/B test your teams. Give half your teams AI tools, measure business outcomes over 2-3 release cycles. Track what customers actually experience, not how much developers produce. Companies that measure business impact from AI will pull ahead. Those measuring vanity metrics will wonder why their expensive tools aren't moving the needle. Stop measuring how much code AI generates. Start measuring how much faster you deliver value to customers. What are you actually measuring? And is it moving your business forward? -> Follow me for more about building great tech organizations at scale. More insights in my book "All Hands on Tech"
-
LLM hallucinations aren't bugs, they're compression artefacts. And we just figured out how to predict them before they happen. 400 stars in one week, the reception has been unreal. Our toolkit is open source and anyone can use it. https://lnkd.in/e4s3X8GK When your LLM confidently states that "Napoleon won the Battle of Waterloo," it's not broken. It's doing exactly what it was trained to do: compress the entire internet into model weights, then decompress on demand. Sometimes, there isn't enough information to perfectly reconstruct rare facts, so it fills gaps with statistically plausible but wrong content. Think of it like a ZIP file corrupted during compression. The decompression algorithm still runs, but outputs garbage where data was lost. The breakthrough: We proved hallucinations occur when information budgets fall below mathematical thresholds. Using our Expectation-level Decompression Law (EDFL), we can calculate exactly how many bits of information are needed to prevent any specific hallucination, before generation even starts. This resolves a fundamental paradox: LLMs achieve near-perfect Bayesian performance on average, yet systematically fail on specific inputs. We proved they're "Bayesian in expectation, not in realisation", optimising average-case compression rather than worst-case reliability. Why this changes everything? Instead of treating hallucinations as inevitable, we can now: Calculate risk scores before generating any text Set guaranteed error bounds (e.g. 95%) Know precisely when to gather more context vs. abstain The full preprint is being released on arXiv this week. Until then, read the preprint PDF we uploaded here: https://lnkd.in/eRf_ecu3 The toolkit works with any OpenAI-compatible API. Zero retraining required. Provides mathematical SLA guarantees for compliance. Perfect for healthcare, finance, legal, anywhere errors aren't acceptable. The era of "trust me, bro" AI is ending. Welcome to bounded, predictable AI reliability. Big thanks to Ahmed K. Maggie C. for all the help putting this + the repo together! #AI #MachineLearning #ResponsibleAI #OpenSource #LLM #Innovation
-
IBM Research 𝗮𝗻𝗱 Yale University 𝗷𝘂𝘀𝘁 𝗿𝗲𝗹𝗲𝗮𝘀𝗲𝗱 𝗮 𝗳𝘂𝗹𝗹 360° 𝗿𝗲𝘃𝗶𝗲𝘄 𝘀𝗰𝗮𝗻 𝗼𝗻𝗲 𝗼𝗳 𝗵𝗼𝘄 𝘄𝗲 𝘁𝗲𝘀𝘁 𝗔𝗜 𝗮𝗴𝗲𝗻𝘁𝘀. ⬇️ They looked at 120+ evaluation methods — and mapped out what’s working and what’s missing. Currently everyone’s building AI agents. Almost no one agrees on how to properly evaluate them. This is critical, because without rigorous evaluation, we can’t trust these systems to be reliable, safe, or ready for real-world use. 𝗛𝗲𝗿𝗲’𝘀 𝘄𝗵𝗮𝘁 𝘀𝘁𝗮𝗻𝗱𝘀 𝗼𝘂𝘁: ⬇️ 1. 𝗕𝗲𝗻𝗰𝗵𝗺𝗮𝗿𝗸𝗶𝗻𝗴 𝗮𝗴𝗲𝗻𝘁𝘀 ≠ 𝗯𝗲𝗻𝗰𝗵𝗺𝗮𝗿𝗸𝗶𝗻𝗴 𝗺𝗼𝗱𝗲𝗹𝘀 ➜ Agents aren’t static LLMs. They act, adapt, and evolve. Old-school metrics can’t keep up with real-world autonomy. 2. 𝗥𝗲𝗳𝗹𝗲𝗰𝘁𝗶𝗼𝗻 𝗶𝘀 𝗺𝗲𝗮𝘀𝘂𝗿𝗮𝗯𝗹𝗲 𝗻𝗼𝘄 ➜ Benchmarks like LLF-Bench evaluate how agents process feedback and course-correct (which is crucial for evaulation quality). Without this, agents just repeat their mistakes. 3. 𝗖𝗼𝘀𝘁-𝗲𝗳𝗳𝗶𝗰𝗶𝗲𝗻𝗰𝘆 𝗶𝘀 𝗯𝗲𝗶𝗻𝗴 𝗶𝗴𝗻𝗼𝗿𝗲𝗱 — 𝗱𝗮𝗻𝗴𝗲𝗿𝗼𝘂𝘀𝗹𝘆 ➜ Top agents burn insane tokens and API calls. We need benchmarks that track performance and price. Otherwise no one can afford to deploy them. 4. 𝗙𝗼𝘂𝗿 𝘀𝗸𝗶𝗹𝗹𝘀 𝗱𝗲𝗳𝗶𝗻𝗲 𝘁𝗼𝗽-𝗽𝗲𝗿𝗳𝗼𝗿𝗺𝗶𝗻𝗴 𝗮𝗴𝗲𝗻𝘁𝘀 ➜ It's critical to evaluate each individual component — otherwise, key weaknesses can go unnoticed and compromise the overall performance: * Breaking down complex tasks (planning) * Using tools and APIs (tool use) * Learning from feedback (reflection) * Remembering previous steps (memory) 5. 𝗧𝗲𝘀𝘁𝗶𝗻𝗴 𝗶𝘀 𝗯𝗲𝗰𝗼𝗺𝗶𝗻𝗴 𝗺𝗼𝗿𝗲 𝗿𝗲𝗮𝗹𝗶𝘀𝘁𝗶𝗰 ➜ New benchmarks simulate actual jobs: * Online shopping (WebArena) * Debugging code (SWE-Bench) * Helping customers (τ-bench) * Research tasks (PaperBench) * Multi-step workflows (OSWorld, CRMWorld) More in the comments and below! 𝗪𝗮𝗻𝘁 𝗺𝗼𝗿𝗲 𝗯𝗿𝗲𝗮𝗸𝗱𝗼𝘄𝗻𝘀 𝗹𝗶𝗸𝗲 𝘁𝗵𝗶𝘀? Subscribe to Human in the Loop — my new weekly deep dive on AI agents, real-world tools, and strategic insights: https://lnkd.in/dbf74Y9E
-
Reliability, evaluation, and “hallucination anxiety” are where most AI programmes quietly stall. Not because the model is weak. Because the system around it is not built to scale trust. When companies move beyond demos, three hard questions appear: →Can we rely on this output? →Do we know what “good” actually looks like? →How much human oversight is enough? The fix is not better prompting. It is a strategy and operating discipline. 𝐅𝐢𝐫𝐬𝐭: Define reliability like a product, not a vibe. Every serious AI use case should have a one-page SLO sheet with measurable targets across: →Task success ↳Right-first-time rate and rubric-based acceptance →Factual grounding ↳Evidence coverage and unsupported-claim tracking →Safety and compliance ↳Policy violations and PII leakage →Operational quality ↳Latency, cost per task, escalation to humans Now “good” is no longer opinion. It is observable. 𝐒𝐞𝐜𝐨𝐧𝐝: evaluation must be continuous, not a one-off demo test. Use a simple loop: 𝐏lan: Define rubrics, datasets, and risk tiers 𝐃o: Run offline evaluations and limited pilots 𝐂heck: Monitor drift and regressions weekly 𝐀ct: Update prompts, data, guardrails, and workflows Support this with an AI test pyramid: →Unit checks for prompts and tool behaviour →Scenario tests for real edge failures →Regression benchmarks to prevent backsliding →Live monitoring in production Add statistical control charts, and you can detect silent degradation before users do. 𝐓𝐡𝐢𝐫𝐝: reduce hallucinations by design. →Run a short failure-mode workshop and engineer controls: →Require retrieval or evidence before answering →Allow safe abstention instead of confident guessing →Add claim checking and tool validation →Use structured intake and clarifying flows You are not asking the model to behave. You are designing a system that expects failure and contains it. 𝐅𝐨𝐮𝐫𝐭𝐡: make human-in-the-loop affordable. Tier risk: →Low risk: Light sampling →Medium risk: Triggered review →High risk: Mandatory approval Escalate only when signals demand it: low confidence, missing evidence, policy flags, or novelty spikes. Review becomes targeted, fast, and a source of improvement data. 𝐅𝐢𝐧𝐚𝐥𝐥𝐲: Operate it like a capability. Track outcomes, risk, delivery speed, and cost on a single dashboard. Hold a short weekly reliability stand-up focused on regressions, failure modes, and ownership. What you end up with is simple: ↳Use case catalogue with risk tiers ↳Clear SLOs and error budgets ↳Continuous evaluation harness ↳Built-in controls ↳Targeted human review ↳Reliability cadence AI does not scale on intelligence alone. It scales on measurable trust. ♻️ Share if you found thisuseful. ➕ Follow (Jyothish Nair) for reflections on AI, change, and human-centred AI #AI #AIReliability #TrustAtScale #OperationalExcellence
-
“A Survey on LLM-as-a-Judge” outlines what could become a foundational shift in how we evaluate AI systems, and the paper is very insightful. The idea is simple, but profound: use LLMs not just to generate content, but to judge it across tasks like summarization, reasoning, classification, and beyond. Why does this matter? Because traditional evaluation methods no longer scale: - Human reviews are expensive, inconsistent, and hard to reproduce. - Automatic metrics like BLEU and ROUGE fail to capture meaning, nuance, or utility. LLM-as-a-Judge offers a compelling alternative: scalable, nuanced, and surprisingly aligned with expert judgment when done right. What makes this paper stand out is the depth and structure it brings to a chaotic space. It: 1. Defines a clear taxonomy of evaluation methods (scoring, pairwise, yes/no, multi-choice) 2. Details the full pipeline from prompt design to model selection to post-processing 3. Surfaces real risks (biases, hallucinations, format brittleness) and proposes mitigation strategies 4. Introduces benchmarks and best practices for evaluating the evaluators themselves In short, it turns a loose idea into a playbook. In the enterprise, “LLM-as-a-Judge” could soon underpin everything from agentic workflows to data labeling, model selection, and QA. It’s a new infrastructure layer, and it demands as much rigor as the models it oversees. Highly recommend reading the full paper if you’re building or deploying GenAI at scale. Link to paper: https://lnkd.in/gsVf6_Zh
-
The most dangerous thing about hallucinations in AI isn't that they're wrong. It's that they don't look wrong. You ask for a source, it gives you a figment. You ask for facts, it makes them up. It doesn’t just lie - it lies eloquently, with citations, formatting, and a tone that screams “trust me.” Just enough jargon to fool the average reader- and sometimes, the expert. In consumer settings, a hallucination is annoying. In a courtroom, hospital, or trading desk, it's catastrophic. That’s why hallucinations are the biggest blocker to AI adoption: they turn an otherwise brilliant assistant into that unreliable coworker whose numbers you always have to double-check. At best, they waste time. At worst, they create liability. Researchers have thrown the kitchen sink at hallucinations: ▪️ Retrieval-Augmented Generation (RAG) - Give the model a search engine sidekick. Instead of free-styling from memory, it fetches real documents, so it answers with receipts. ▪️Self-Critique Loops - Tools like SelfCheckGPT or Chain of Verification reread outputs like a paranoid editor. ▪️Fine-Tuning with Human Feedback - Pavlov method: humans reward outputs that look good. ▪️Conservative Decoding - Language models have a 'creativity dial'. High temperature makes them improvise like jazz musicians; low temperature makes them stick to the teleprompter. These techniques work, but trade-offs loom: accuracy costs latency and compute; grounding kills creativity. Which is why many teams now run two modes - “idea jam” (high temp, hallucinations tolerated) and “serious business” (low temp + retrieval + guardrails). Last week, OpenAI released a new paper titled “Why language models hallucinate”. Their core point: hallucinations aren’t just an artifact of messy training data or exotic transformer math - they’re the rational outcome of a badly designed reward system. Current benchmarks reward certainty and correctness but don’t penalize confident errors or give credit for saying “I don’t know.” This can implicitly push models to guess. RLHF today trains models to be helpful, harmless, polite. Human raters tend to upvote answers that are fluent and well-structured even if they're factually shaky. This optimizes for charm, not epistemic hygiene. OpenAI argues for a new system: reward calibrated uncertainty and punish confident wrongs. In other words, give points for “I don’t know” and dock points for swaggering mistakes. So while both approaches use reinforcement, the values baked in are different. - RLHF gave us ambitious interns - always have an answer, always sound polished. - OpenAI is pushing for seasoned experts - confident when right, silent when not. It’s corporate culture 101. Promote people for speaking up regardless of accuracy, and you’ll soon have a room full of confident nonsense.
-
The new consulting edge isn't AI. It's knowing when your AI is wrong. Every consultant has been there: You ask AI to analyze documents and generate insights. During review, you spot a questionable stat that doesn't exist in the source! AI hallucinations are a problem. The solution? Implementing "prompt evals". → Prompt evals: directions that force AI to verify its own work before responding. A formula for effective evals: 1. Assign a verification role → "Act as a critical fact-checker whose reputation depends on accuracy" 2. Specify what to verify → "Check all revenue projections against the quarterly reports in the appendix" 3. Define success criteria → "Include specific page references for every statistic" 4. Establish clear terminology → "Rate confidence as High/Medium/Low next to each insight" Here is how your prompt will change: OLD: "Analyze these reports and identify opportunities." NEW: "You are a senior analyst known for accuracy. List growth opportunities from the reports. For each insight, match financials to appendix B, match market claims to bibliography sources, add page ref + High/Med/Low confidence, otherwise write REQUIRES VERIFICATION.” Mastering this takes practice, but the results are worth it. What AI leaders know that most don't: "If there is one thing we can teach people, it's that writing evals is probably the most important thing." Mike Krieger, Anthropic CPO By the time most learn basic prompting, leaders will have turned verification into their competitive advantage. Steps to level-up your eval skills: → Log hallucinations in a "failure library" → Create industry-specific eval templates → Test evals with known error examples → Compare verification with competitors Next time you're presented with AI-generated analysis, the most valuable question isn't about the findings themselves, but: 'What evals did you run to verify this?' This simple inquiry will elevate your teams approach to AI & signal that in your organization, accuracy isn't optional.
-
Over the last year, I’ve seen many people fall into the same trap: They launch an AI-powered agent (chatbot, assistant, support tool, etc.)… But only track surface-level KPIs — like response time or number of users. That’s not enough. To create AI systems that actually deliver value, we need 𝗵𝗼𝗹𝗶𝘀𝘁𝗶𝗰, 𝗵𝘂𝗺𝗮𝗻-𝗰𝗲𝗻𝘁𝗿𝗶𝗰 𝗺𝗲𝘁𝗿𝗶𝗰𝘀 that reflect: • User trust • Task success • Business impact • Experience quality This infographic highlights 15 𝘦𝘴𝘴𝘦𝘯𝘵𝘪𝘢𝘭 dimensions to consider: ↳ 𝗥𝗲𝘀𝗽𝗼𝗻𝘀𝗲 𝗔𝗰𝗰𝘂𝗿𝗮𝗰𝘆 — Are your AI answers actually useful and correct? ↳ 𝗧𝗮𝘀𝗸 𝗖𝗼𝗺𝗽𝗹𝗲𝘁𝗶𝗼𝗻 𝗥𝗮𝘁𝗲 — Can the agent complete full workflows, not just answer trivia? ↳ 𝗟𝗮𝘁𝗲𝗻𝗰𝘆 — Response speed still matters, especially in production. ↳ 𝗨𝘀𝗲𝗿 𝗘𝗻𝗴𝗮𝗴𝗲𝗺𝗲𝗻𝘁 — How often are users returning or interacting meaningfully? ↳ 𝗦𝘂𝗰𝗰𝗲𝘀𝘀 𝗥𝗮𝘁𝗲 — Did the user achieve their goal? This is your north star. ↳ 𝗘𝗿𝗿𝗼𝗿 𝗥𝗮𝘁𝗲 — Irrelevant or wrong responses? That’s friction. ↳ 𝗦𝗲𝘀𝘀𝗶𝗼𝗻 𝗗𝘂𝗿𝗮𝘁𝗶𝗼𝗻 — Longer isn’t always better — it depends on the goal. ↳ 𝗨𝘀𝗲𝗿 𝗥𝗲𝘁𝗲𝗻𝘁𝗶𝗼𝗻 — Are users coming back 𝘢𝘧𝘵𝘦𝘳 the first experience? ↳ 𝗖𝗼𝘀𝘁 𝗽𝗲𝗿 𝗜𝗻𝘁𝗲𝗿𝗮𝗰𝘁𝗶𝗼𝗻 — Especially critical at scale. Budget-wise agents win. ↳ 𝗖𝗼𝗻𝘃𝗲𝗿𝘀𝗮𝘁𝗶𝗼𝗻 𝗗𝗲𝗽𝘁𝗵 — Can the agent handle follow-ups and multi-turn dialogue? ↳ 𝗨𝘀𝗲𝗿 𝗦𝗮𝘁𝗶𝘀𝗳𝗮𝗰𝘁𝗶𝗼𝗻 𝗦𝗰𝗼𝗿𝗲 — Feedback from actual users is gold. ↳ 𝗖𝗼𝗻𝘁𝗲𝘅𝘁𝘂𝗮𝗹 𝗨𝗻𝗱𝗲𝗿𝘀𝘁𝗮𝗻𝗱𝗶𝗻𝗴 — Can your AI 𝘳𝘦𝘮𝘦𝘮𝘣𝘦𝘳 𝘢𝘯𝘥 𝘳𝘦𝘧𝘦𝘳 to earlier inputs? ↳ 𝗦𝗰𝗮𝗹𝗮𝗯𝗶𝗹𝗶𝘁𝘆 — Can it handle volume 𝘸𝘪𝘵𝘩𝘰𝘶𝘵 degrading performance? ↳ 𝗞𝗻𝗼𝘄𝗹𝗲𝗱𝗴𝗲 𝗥𝗲𝘁𝗿𝗶𝗲𝘃𝗮𝗹 𝗘𝗳𝗳𝗶𝗰𝗶𝗲𝗻𝗰𝘆 — This is key for RAG-based agents. ↳ 𝗔𝗱𝗮𝗽𝘁𝗮𝗯𝗶𝗹𝗶𝘁𝘆 𝗦𝗰𝗼𝗿𝗲 — Is your AI learning and improving over time? If you're building or managing AI agents — bookmark this. Whether it's a support bot, GenAI assistant, or a multi-agent system — these are the metrics that will shape real-world success. 𝗗𝗶𝗱 𝗜 𝗺𝗶𝘀𝘀 𝗮𝗻𝘆 𝗰𝗿𝗶𝘁𝗶𝗰𝗮𝗹 𝗼𝗻𝗲𝘀 𝘆𝗼𝘂 𝘂𝘀𝗲 𝗶𝗻 𝘆𝗼𝘂𝗿 𝗽𝗿𝗼𝗷𝗲𝗰𝘁𝘀? Let’s make this list even stronger — drop your thoughts 👇