LLM Evaluation Tools

Explore top LinkedIn content from expert professionals.

  • View profile for Kuldeep Singh Sidhu

    Senior Data Scientist @ Walmart | BITS Pilani

    16,493 followers

    Are Your LLM Rerankers Actually Good at Handling Novel Queries? New research from the Universität Innsbruck challenges a fundamental assumption in information retrieval: that state-of-the-art reranking models generalize well to unseen content. The Hidden Problem: Most benchmarks like TREC DL19/DL20 and BEIR contain queries that overlap with LLM training data. This contamination makes it nearly impossible to assess true generalization capability. The research introduces FutureQueryEval-a dataset with 148 queries collected after April 2025, ensuring zero overlap with existing model training cutoffs. Technical Deep Dive: The study evaluates 22 methods across three core paradigms: Pointwise Reranking scores query-document pairs independently with O(n) complexity. Models like MonoT5 use T5's encoder-decoder architecture with prompts like "Query: q Document: d Relevant:" to predict relevance probabilities. The challenge? Inconsistent score calibration across different prompts and heavy reliance on scoring APIs that many generation-only LLMs lack. Pairwise Reranking compares document pairs using prompts to determine relative relevance, aggregating results through methods like Heapsort (O(n log n)) or sliding windows (O(n)). PRP-FLAN-UL2 leads here, but the approach struggles with transitivity issues and scales poorly due to quadratic complexity in naive implementations. Listwise Reranking processes multiple documents simultaneously, with models like RankGPT generating identifier permutations (e.g., " > ") to capture inter-document relationships. While achieving O(n) complexity with sliding windows, these methods face challenges with long contexts and positional biases. The Surprising Results: On familiar benchmarks, RankGPT-GPT-4 dominates with 75.59 nDCG@10 on DL19. But on FutureQueryEval? Performance drops 5-15% across all categories. Listwise methods show the smallest degradation (8%), suggesting inter-document modeling provides better robustness. Meanwhile, fine-tuned models like MonoT5-3B (60.75 nDCG@10) and TWOLAR-XL (60.03) maintain strong performance, while lightweight options like FlashRank-MiniLM balance efficiency with 55.43 nDCG@10. Under the Hood: The key differentiator is how models handle context. Pointwise methods treat each document independently, missing relationship signals. Pairwise methods capture relative preferences but struggle with consistency. Listwise approaches like Zephyr-7B (62.65 nDCG@10 on novel queries) excel by modeling full document lists through attention mechanisms that weigh inter-document relevance simultaneously. The research exposes a critical limitation: claims of "generalization" based on standard benchmarks may be overstated. As retrieval systems increasingly power RAG applications and enterprise search, understanding how rerankers perform on truly unseen content becomes essential for building reliable AI systems.

  • View profile for Sahar Mor

    I help researchers and builders make sense of AI | ex-Stripe | aitidbits.ai | Angel Investor

    42,074 followers

    A new study shows that even the best financial LLMs hallucinate 41% of the time when faced with unexpected inputs. FailSafeQA, a new benchmark from Writer, tests LLM robustness in finance by simulating real-world mishaps, including misspelled queries, incomplete questions, irrelevant documents, and OCR-induced errors. Evaluating 24 top models revealed that: * OpenAI’s o3-mini, the most robust, hallucinated in 41% of perturbed cases * Palmyra-Fin-128k-Instruct, the model best at refusing irrelevant queries, still struggled 17% of the time FailSafeQA uniquely measures: (1) Robustness - performance across query perturbations (e.g., misspelled, incomplete) (2) Context Grounding - the ability to avoid hallucinations when context is missing or irrelevant (3) Compliance - balancing robustness and grounding to minimize false responses Developers building financial applications should implement explicit error handling that gracefully addresses context issues, rather than solely relying on model robustness. Developing systems to proactively detect and respond to problematic queries can significantly reduce costly hallucinations and enhance trust in LLM-powered financial apps. Benchmark details https://lnkd.in/gq-mijcD

  • View profile for Sneha Vijaykumar

    Data Scientist @ Takeda | Ex-Shell | Gen AI | LLM | RAG | AI Agents | Azure | NLP | AWS

    25,283 followers

    If you’ve ever shipped a GenAI model to production, you already know the real interview isn’t about transformers, it’s about everything that breaks the moment real users touch your system. 1) How would you evaluate an LLM powering a Q&A system? Approach: Don’t talk about accuracy alone. Break it down into: ✅ Functional metrics: exact match, F1, BLEU, ROUGE depending on task. ✅ Safety metrics: hallucination rate, refusal rate, PII leakage. ✅ User-facing metrics: latency, token cost, answer completeness. ✅ Human evaluation: rubric-based scoring from SMEs when answers aren’t deterministic. ✅ A/B tests: compare model variants on real user flows. 2) How do you handle hallucinations in production? Approach: ✅ Show you understand layered mitigation: ✅ Retrieval first (RAG) to ground the model. ✅ Constrain the prompt: citations, “answer only from provided context,” JSON schemas. ✅ Post-generation validation like fact-checking rules or context-overlap checks. ✅ Fall-back behaviors when confidence is low: ask for clarification, return source snippets, route to human. 3) You’re asked to improve retrieval quality in a RAG pipeline. What do you check first? Approach: Walk through a debugging flow: ✅ Check document chunking (size, overlap, boundaries). ✅ Evaluate embedding model suitability for domain. ✅ Inspect vector store configuration (HNSW params, top_k). ✅ Run retrieval diagnostics: is the top_k relevant to the question? ✅ Add metadata filters or rerankers (cross-encoder, ColBERT-style scoring). 4) How do you monitor a GenAI system after deployment? Approach: ✅ Make it clear that monitoring isn’t optional. ✅ Latency and cost per request. ✅ Token distribution shifts (prompt bloat). ✅ Hallucination drift from user conversations. ✅ Guardrail violations and safety triggers. ✅ Retrieval hit rate and query types. ✅ Feedback loops from thumbs up/down or human review. 5) How do you decide between fine-tuning and using RAG? Approach: ✅ Use a decision tree mentality: ✅ If the issue is knowledge freshness, go with RAG. ✅ If the issue is formatting/style, go with fine-tuning. ✅ If the model needs domain reasoning, consider fine-tuning or LoRA. ✅ If the data is large and structured, use RAG + reranking before touching training. Most interviews test what you know. GenAI interviews test what you’ve survived. Follow Sneha Vijaykumar for more... 😊 #genai #datascience #rag #production #interview #questions #careergrowth #prep

  • View profile for Daniel Lee

    Ship AI @ JoinAI | Founder @ DataInterview | Ex-Google

    152,259 followers

    Evaluating ML is easy. Use metrics like AUC or MSE. But what about LLMs? ↓ LLM evaluation is not easy. Unless the task is a simple classification like flagging an email as ham or spam, it's difficult since... ☒ Manual review is costly ☒ Task input/output is open-ended ☒ Benchmarks like MMLU generic for custom cases So, how do you evaluate on a scale? Here are 3 strategies to employ ↓ 𝟭. 𝗦𝗲𝗺𝗮𝗻𝘁𝗶𝗰 𝗦𝗶𝗺𝗶𝗹𝗮𝗿𝗶𝘁𝘆 Two texts with similar meanings will have embedding vectors that are close together. Use cosine similarity to compare ideal output samples with LLM-generated responses. A higher score indicates a better response. 𝟮. 𝗟𝗟𝗠-𝗮𝘀-𝗮-𝗝𝘂𝗱𝗴𝗲 Getting a human to evaluate LLM output is costly. So, create an LLM agent that mimics a human reviewer. Create a prompt with a grading rubric with examples. Get the reviewer agent to evaluate the main agent on a scale. 𝟯. 𝗘𝘅𝗽𝗹𝗶𝗰𝗶𝘁 𝗙𝗲𝗲𝗱𝗯𝗮𝗰𝗸 Add a UI to the chat interface to gather thumbs up/down and re-generation feedback. This helps measure the quality of the output from the users themselves. With this feedback loop in place, optimize your LLM system with prompt engineering, fine-tuning, RAG, and other techniques. Let's bounce ideas around. How do you evaluate LLM? Drop a comment ↓

  • View profile for Luke Yun

    Founder @ Decisive Machines | AI Researcher @ Harvard Medical School

    33,152 followers

    Stanford researchers put today’s largest language models head-to-head with expert Cochrane systematic reviews and the experts are still winning. 𝗠𝗲𝗱𝗘��𝗶𝗱𝗲𝗻𝗰𝗲 𝗶𝘀 𝗮 𝗯𝗲𝗻𝗰𝗵𝗺𝗮𝗿𝗸 𝘁𝗼 𝗱𝗶𝗿𝗲𝗰𝘁𝗹𝘆 𝘁𝗲𝘀𝘁 𝘄𝗵𝗲𝘁𝗵𝗲𝗿 𝗟𝗟𝗠𝘀 𝗰𝗮𝗻 𝗿𝗲𝗮𝗰𝗵 𝗰𝗹𝗶𝗻𝗶𝗰𝗮𝗻-𝗴𝗿𝗮𝗱𝗲 𝗰𝗼𝗻𝗰𝗹𝘂𝘀𝗶𝗼𝗻𝘀 𝘂𝘀𝗶𝗻𝗴 𝘁𝗵𝗲 𝘀𝗮𝗺𝗲 𝘀𝘁𝘂𝗱𝗶𝗲𝘀. 1. Constructed a 284-question benchmark from 100 Cochrane reviews (10 specialties) and linked every question to its 329 source studies. 2. Benchmarked 24 LLMs ranging 7B-671B parameters; the leader (DeepSeek V3) matched expert answers only 62 % of the time while GPT-4.1 reached 60%, leaving a 37 % error margin. 3. Discovered bigger models beyond 70 B, “reasoning” modes, and medical fine-tuning often failed to boost and sometimes hurt accuracy. 4. Exposed systematic overconfidence: performance dropped with longer contexts and models rarely showed skepticism toward low-quality or conflicting evidence. Point 3 shows inherently how strictly building based on scaling laws for finding the needle in the haystack probably isn't the most effective usage of energy. RAGs, then agentic RAG; both have shown to be somewhat effective, but we have yet to see something that efficiently and effectively allows for highly accurate generation. Also, because there is a lot of junk out there, is there any significant work on how to maximize LLM performance in discerning between low-quality or conflicting evidence? The most important step is discerning what to add and avoid when training your model or building a database for the RAG. There's methods out there, but not enough that are both efficient enough and effective. And with the speed that medical knowledge keeps multiplying (esp with not so great AI-written work), I would love to see more people focused on building great, fast discernment.  Here's the awesome work: https://lnkd.in/gBmCzGda Congrats to Christopher Polzak, Alejandro Lozano, Min Woo Sun, James Burgess, Serena Yeung-Levy and co! I post my takes on the latest developments in health AI – 𝗰𝗼𝗻𝗻𝗲𝗰𝘁 𝘄𝗶𝘁𝗵 𝗺𝗲 𝘁𝗼 𝘀𝘁𝗮𝘆 𝘂𝗽𝗱𝗮𝘁𝗲𝗱! Also, check out my health AI blog here: https://lnkd.in/g3nrQFxW

  • View profile for Sohrab Rahimi

    Director, AI/ML Lead @ Google

    23,842 followers

    In our daily discussions about generative AI, the fear of AI 'hallucinating'—or fabricating information—often surfaces. This conversation, however, opens the door to an exciting question: Could AI surpass human accuracy in identifying truths? Enter a groundbreaking study by #Google #DeepMind and #Stanford researchers, which introduces a novel framework called SAFE. Tested across approximately 16,000 facts, SAFE demonstrated superhuman performance, aligning with human evaluators 72% of the time and besting them in 76% of contested cases, all while being 20 times more cost-effective than traditional methods. The essence of this methodology lies in two pivotal steps. Initially, the LongFact prompt set, crafted using GPT-4, targets the comprehensive assessment of long-form content's factuality over 38 varied topics. Then, the SAFE framework takes this base further by meticulously breaking down responses into individual facts and validating each through targeted Google Search queries. The process unfolds across four critical stages: 1. 𝗣𝗿𝗼𝗺𝗽𝘁 𝗚𝗲𝗻𝗲𝗿𝗮𝘁𝗶𝗼𝗻 𝘄𝗶𝘁𝗵 𝗟𝗼𝗻𝗴𝗙𝗮𝗰𝘁: Crafting varied, fact-seeking prompts to elicit detailed LLM responses. 2. 𝗗𝗲𝗰𝗼𝗺𝗽𝗼𝘀𝗶𝘁𝗶𝗼𝗻 𝗶𝗻𝘁𝗼 𝗜𝗻𝗱𝗶𝘃𝗶𝗱𝘂𝗮𝗹 𝗙𝗮𝗰𝘁𝘀: Segmenting these responses into distinct facts for precise evaluation. 3. 𝗙𝗮𝗰𝘁 𝗩𝗲𝗿𝗶𝗳𝗶𝗰𝗮𝘁𝗶𝗼𝗻 𝘃𝗶𝗮 𝗚𝗼𝗼𝗴𝗹𝗲 𝗦𝗲𝗮𝗿𝗰𝗵: Using LLMs to formulate and dispatch queries, checking each fact's accuracy against search results. 4. 𝗜𝘁𝗲𝗿𝗮𝘁𝗶𝘃𝗲 𝗥𝗲𝗮𝘀𝗼𝗻𝗶𝗻𝗴 𝗮𝗻𝗱 𝗘𝘃𝗮𝗹𝘂𝗮𝘁𝗶𝗼𝗻: Applying a multi-step reasoning process to assess the support level for each fact. This innovative approach doesn't just mark a leap in evaluating LLM-generated content's factuality; it also paves the way for more trustworthy AI applications in countless fields. For a deep dive into this fascinating study, including access to the LongFact prompts and the SAFE framework, visit: https://lnkd.in/eVr4rz-u Find the full paper here: https://lnkd.in/eSjZ5Tn9 #GenAI #LLM #Hallucination #FactChecking #DeepMind #Stanford #Google #SAFE #LongFact

  • View profile for George Hurn-Maloney

    Co-Founder @ Fastino

    8,515 followers

    We published a case study on LLM inadequacy in healthcare last week. This week, a Nature Medicine article reinforced our findings. Luc Rocher and colleagues from Oxford Internet Institute, University of Oxford published an article in Nature Medicine testing GPT-4o, Llama 3, and Command R+ with 1,298 people across 10 medical scenarios. The results reveal what the authors call a “translation gap.” When the researchers fed the models with clean, structured data in the form of Standardized Medical Scenarios (SMS), they identified medical conditions with an average of 94.9% accuracy. However, when they used the same models to identify medical conditions in a chatbot scenario (with less structured data and more "noise"), they were only 34.9% accurate. Participants who used a chatbot identified conditions in less than 34.5% of cases, and the right course of action in less than 44.2%. This demonstrates that LLMs are excellent at encoding medical knowledge but quite poor at generating it. The researchers found that the LLMs were highly sensitive to user bias and tended to agree with the user’s assessment of the situation significantly more often than they should. This is unsurprising, given recent findings about LLM sycophancy. They also found that in chatbot scenarios, the LLMs were sensitive to even very slight variations in how users phrased questions, demonstrating overall brittleness and unreliability in medical language generation. The Nature study shows exactly why this matters: LLMs are excellent encoders of medical knowledge but poor generators in practice. This paper underscores one of the most critical success patterns we're seeing in AI right now: model architectures must be matched to their downstream tasks. Fastino Labs's GLiNER2 excels at encoding and extracting information, not generating erroneous advice. Links to the Nature Medicine paper and our blog post below. 🔗 Nature Medicine paper: https://lnkd.in/gesYWrVw 🔗 Blog: https://lnkd.in/gcNmnA8T

  • View profile for Bhargav Patel, MD, MBA

    Physician-Leader at the Intersection of AI, Medicine & Psychiatry | Medical + AI Researcher | Adult & Child Psychiatrist | Neuroscientist | Founder | Upcoming Books: Trauma Transformed & The Future of AI in Healthcare

    11,305 followers

    LLMs scored 95% on identifying medical conditions when tested alone. When real people used them for medical advice, accuracy dropped to 35%. A new randomized study in Nature Medicine tested whether large language models actually help the public make better medical decisions. 1,298 participants were given medical scenarios and asked to identify conditions and recommend next steps. GPT-4o, Llama 3, and Command R+ all performed well when directly prompted. They identified relevant conditions in 94.9% of cases and recommended correct disposition in 56.3% on average. But when participants used these same models for assistance, condition identification dropped below 34.5% and disposition accuracy fell to 44.2% (no better than the control group using search engines). The gap wasn't medical knowledge. It was interaction. Researchers analyzed conversation transcripts and found users provided incomplete information to models. Models sometimes misinterpreted context or gave inconsistent advice. Even when models suggested correct conditions, users didn't consistently follow recommendations. Standard medical benchmarks didn't predict this. Models achieved passing scores (>60%) on MedQA questions matched to scenarios but still failed in interactive testing. Performance on structured exams was largely uncorrelated to performance with real users. Simulated patient interactions didn't predict it either. When researchers replaced humans with LLM-simulated users, simulated users performed better (57.3% vs 44.2%) and showed less variation. Simulations were only weakly predictive of human behavior. Here’s what this means: Benchmark performance is necessary but insufficient. A model scoring 80% on medical licensing exams can produce 20% accuracy when paired with real users. The constraint isn't algorithmic capability. It's human-AI interaction design. Users don't know what information to provide. Models don't ask the right clarifying questions. Correct suggestions get lost in conversation. For clinicians: expect patients to arrive with AI-informed conclusions that may not be accurate. Patients using LLMs were no better at assessing clinical acuity than those using traditional methods. For developers: user testing with real humans must precede deployment. Simulations and benchmarks don't capture interaction failures. AI excels at medical exams. But medicine isn't a multiple-choice test. It's a conversation under uncertainty. — Source: Nature Medicine - "Reliability of LLMs as medical assistants for the general public"

  • View profile for Mikyo King

    Head of Open Source. Building the future of AI Observability at Arize AI

    3,259 followers

    If you're evaluating LLMs with off-the-shelf metrics, you might be measuring the wrong things. Generic scores like "helpfulness" or "hallucination rate" can create a false sense of confidence. They don't tell you what's actually breaking in your system—or why your users are frustrated. 💡 There's a better approach: open coding. It's a technique from qualitative research. Rather than starting with predefined categories, you review your traces and write open-ended notes about what you observe. You let the failure modes emerge from your data. A note might be: "The agent missed a clear opportunity to re-engage after the user mentioned price concerns." Or: "Asked for confirmation twice in a row before booking." These observations are specific, actionable, and grounded in how your product actually behaves. From there, you group similar notes into themes, count the frequencies, and suddenly you have a prioritized list of what to fix—derived from evidence, not assumptions. This isn't new. It's been central to ML error analysis for decades. But it's especially valuable now, when so many teams are shipping LLM features without a clear framework for understanding quality. Gergely Orosz and Hamel Husain published an excellent deep-dive on this workflow. If you're building AI products, it's worth your time. Arize AI Phoenix now supports this with a Span Notes API—add notes to traces programmatically and build annotation workflows into your evaluation process. https://lnkd.in/g2Sdc8vz https://lnkd.in/gRx_2tjV

Explore categories