Here is why leaderboards can fool you (and what to do instead) 👇 Benchmarks are macro averages, and your application is a micro reality. A model that’s top-3 on MMLU or GSM-Plus might still bomb when asked to summarize legal contracts, extract SKUs from receipts, or answer domain-specific FAQs. That’s because: 👉 Benchmarks skew toward academic tasks and short-form inputs. Most prod systems run multi-turn, tool-calling, or retrieval workflows the benchmark never sees. 👉 Scores are single-shot snapshots. They don’t cover latency, cost, or robustness to adversarial prompts. 👉 The “average of many tasks” hides mode failures. A 2-point gain in translation might mask a 20-point drop in structured JSON extraction. In short, public leaderboards tell you which model is good in general, not which model is good for you . 𝗕𝘂𝗶𝗹𝗱 𝗲𝘃𝗮𝗹𝘀 𝘁𝗵𝗮𝘁 𝗺𝗶𝗿𝗿𝗼𝗿 𝘆𝗼𝘂𝗿 𝘀𝘁𝗮𝗰𝗸 1️⃣ Trace the user journey. Map the critical steps (retrieve, route, generate, format). 2️⃣ Define success per step. Example metrics: → Retrieval → document relevance (binary). → Generation → faithfulness (factual / hallucinated). → Function calls → tool-choice accuracy (correct / incorrect). 3️⃣ Craft a golden dataset. 20-100 edge-case examples that stress real parameters (long docs, unicode, tricky entities). 4️⃣ Pick a cheap, categorical judge. “Correct/Incorrect” beats 1-5 scores for clarity and stability 5️⃣ Automate in CI/CD and prod. Gate PRs on offline evals; stream online evals for drift detection. 6️⃣ Iterate relentlessly. False negatives become new test rows; evaluator templates get tightened; costs drop as you fine-tune a smaller judge. When you evaluate the system, not just the model, you’ll know exactly which upgrade, prompt tweak, or retrieval change pushes the real-world metric that matters: user success. How are you’re tailoring evals for your own LLM pipeline? Always up to swap notes on use-case-driven benchmarking Image Courtesy: Arize AI ---------- Share this with your network ♻️ Follow me (Aishwarya Srinivasan) for more AI insights and resources!
Mobile User Experience
Explore top LinkedIn content from expert professionals.
-
-
Founders who actually use their own product and become part of their target audience get to really understand the pains. Being a founder who uses your own product puts you in your customers' shoes. You see firsthand what works, what doesn’t and where the pain points are. This insider view is priceless because you really understand the needs and frustrations of your audience. When you live your users experience, you build real empathy. You feel their struggles and can create solutions that truly help. This goes beyond just data and survey — it’s about living the same experience. Using your product often helps you spot small but important fixes that might get missed otherwise. These little tweaks can really boost user satisfaction and product quality. Plus being an active user lets you connect with your community better. You join conversations and get direct feedback, keeping you in touch with your users' changing needs. So scratch your own itch and solve problems that you’ve personally experience because this can be a huge competitive advantage.
-
What we should want from interacting with AI is greater knowledge and capabilities, not just outputs. We should maximize "better knowledge transfer: the ability of models to communicate reasoning in ways humans can understand, apply, and learn from". A wonderful study from Princeton Language and Intelligence and Stanford University researchers measures human-AI "knowledge transfer", involving human-AI ideation on problem-solving, then independent solution implementation to identify the impact on human understanding. They conclude that knowledge transfer is inconsistent and requires dedicated optimization. Some of the specific insights in the paper: 🔀 Model performance is not the same as knowledge transfer impact. Claude-3.7-Sonnet improved human success on coding tasks by +25 percentage points, despite a solo solve rate of just 45%. Meanwhile, Gemini-2.5-Pro, which solved 81.3% of tasks alone, actually reduced human solve rates when paired with it. High capability does not guarantee communicability. 🧑🏫 Teaching style trumps correctness in math. Users favored models that framed reasoning accessibly over those that offered technically precise but dense or symbolic outputs. For instance, models like o1 scored high in accuracy (83.3%) but were often rated poorly because users couldn’t follow the explanation style. Preferences diverged sharply from performance in math tasks. 🔍 Users often defer—even when the model is wrong. In 5% of cases, participants explicitly said they trusted the model without question. This overreliance led to skipped planning and mistaken implementations, even when the model’s output was incorrect. This emphasizes the need for models to invite engagement, not passive acceptance. 📐 Communication must match user expertise. Models that broke down reasoning and checked for understanding were highly rated when paired with less skilled users. The same approach frustrated more advanced users, who preferred direct, concise input. For example, Gemini-2.5-Pro scored a 27.2% preference in cases where it clearly outskilled the user, but just 4.4% when the user was more capable. 🧭 Strategy helps more than steps. Participants highlighted moments when models nudged them toward the right approach—like recalling a useful algorithmic pattern—as especially valuable. Strategic cues were more effective than exhaustive walk-throughs, which often buried the core idea under detail. 💡 Format and style shape impact. Overly verbose or poorly formatted responses were a recurring issue, with 15% of feedback noting formatting problems and 4% citing unnecessary wordiness. Even correct insights failed to transfer if presented in an unstructured or overwhelming way. I'll be sharing lots more insights into effective Humans + AI collaboration!
-
How Reliable Are Your Offline Recommender System Tests? New Research Reveals Critical Biases Offline evaluation remains the dominant approach for benchmarking recommender systems, but researchers from Universidade Federal de Minas Gerais and University of Gothenburg have exposed fundamental reliability issues in how we sample data for these evaluations. The core problem: users only interact with items they're shown (exposure bias), and evaluations typically use only a sampled subset of items rather than full catalogs (sampling bias). These compounding biases can severely distort which models appear to perform best. The Framework The research introduces a systematic evaluation across four dimensions: - Resolution: can the sampler distinguish between competing models? - Fidelity: does sampling preserve full evaluation rankings? - Robustness: do results remain stable under different exposure conditions? - Predictive power: do biased samples recover ground-truth preferences? Key Technical Findings Using the KuaiRec dataset with complete user-item preferences, the team simulated multiple exposure policies (uniform, popularity-biased, positivity-biased) at varying sparsity levels (0-95%), then tested nine sampling strategies including uniform random, popularity-weighted, positivity-weighted, and propensity-corrected approaches like WTD and Skew. The results challenge conventional wisdom. Larger sample sizes don't guarantee better evaluation-what matters is which- items get sampled. Under high sparsity (90-95%), many samplers produce excessive tie rates between models, losing discriminative power. Bias-aware strategies like WTD, WTDH, and Skew consistently outperformed naive baselines, maintaining stronger alignment with ground truth even under severe data constraints. Perhaps most striking: even the "Exposed" sampler (using all logged items) showed degradation under biased logging, while carefully designed smaller samples often proved more reliable. Practical Implications For practitioners: your choice of negative sampling strategy fundamentally impacts which models you'll select. The research suggests prioritizing methods that account for exposure patterns, particularly in sparse data regimes. The paper's code and complete experimental framework are publicly available, enabling teams to audit their own evaluation pipelines.
-
In the last 90 days I spoke to 12 CXO. They all said one thing: GenAI doesn't deliver business value. The reason? It’s not because of model choice. Not because of bad prompts. But because they skip the most important part: LLM evaluation This is why evals matter. In one Datali project, testing took us from 60% to 92% accuracy. Not by luck and blind trying. But by building a rigorous, automated testing pipeline. Here’s the boring but harsh truth: You don’t write a perfect system prompt and test it. You write tests first and discover prompts that pass them. This what you get: 1// You gain crystal clear visibility - the perfect picture of what works and what doesn’t. You see how your system behaves across real-world inputs. You know where failures happen and why. You can plan risk mitigation strategies early 2// You iterate faster. Once you're testing thoroughly, you can run more experiments, track their results and revisit what worked best. Even months later. You catch problems early. You refine prompts, add data or fine-tune with confidence. You iterate faster from PoC → MVP → production, adjusting to user feedback without guesswork. 3// You build better products in less time. The better means here: Higher accuracy → less hallucination, better task handling. More stability → no surprises in production, fewer user complaints. 4// You reach the desired business impact: ROI, KPIs and cost savings. This is the combined result of previous actions. They drive your KPIs. If your system is accurate, stable and aligned to the user’s goals - that’s everything you need. Shorter development cycles = faster time to market Fewer bugs = lower support costs Focused iterations = less wasted dev time It’s priceless. But you can get it only with the right approach.
-
If you’re a UX researcher curious about what Structural Equation Modeling (SEM) can actually do for your work, you’re in the right place. Let’s say you’re working on a grocery planning app. Users enter ingredients they have, and the app recommends recipes. Now you want to understand how to make that experience better. You might have some intuitive ideas: maybe if the app is easy to use, the personalization feels stronger. If personalization improves, satisfaction goes up. And when users are satisfied, they’re more likely to stick around. But how do you test that whole chain of relationships at once? That’s exactly what SEM is built for. So what is SEM? It’s a statistical framework that helps you test how different aspects of a user’s experience are linked - simultaneously. Unlike traditional methods that analyze one relationship at a time, SEM lets you look at the full picture, including both visible data (like task success or ratings) and hidden concepts (like trust or satisfaction). These hidden concepts are called latent variables. You don’t measure them directly, you estimate them through things like survey questions. For example, satisfaction might be reflected in responses like “I enjoy using this app” or “This app meets my needs.” SEM is especially helpful because UX is never just one thing. Users’ feelings and behaviors are shaped by a web of interconnected elements like ease of use, trust, enjoyment, and perceived usefulness. If you want to know what really drives continued use, you need to model the whole system, not just isolated parts. This kind of modeling lets you go beyond surface-level stats. You can separate the things you observe (like a 1-5 star rating) from the psychological constructs you care about (like satisfaction). You can also identify which features influence others indirectly, such as how ease of use might boost satisfaction by first improving personalization. You can even account for measurement error and compare different user groups, like first-time users versus power users. Let’s bring it back to our grocery app. You might collect data on how easy users find the app to navigate, how personalized the recommendations feel, how satisfied they are overall, and whether they intend to keep using it. SEM lets you test how each of those pieces fits together. The results might show that ease of use drives personalization, which increases satisfaction, which in turn predicts continued use. It’s a roadmap for product decisions. If you’re new to SEM, don’t worry. Start by learning the basics of regression and factor analysis. From there, tools like AMOS (great for visual modeling) or R’s lavaan package (great if you like code) can take you further. Two great books for getting started are Barbara Byrne’s Structural Equation Modeling with AMOS and Rex Kline’s Principles and Practice of SEM.
-
This Nature Medicine paper is not an indictment of users. It’s an indictment of how we evaluate and deploy LLMs. The study shows something subtle but important: when large language models are used as public-facing medical assistants, performance collapses—not because people are “bad users,” but because the systems are not designed to function reliably in real human interactions. In controlled testing, the models themselves perform well. But once embedded in an interactive setting, their outputs become: 1. inconsistent across semantically similar inputs 2. poorly calibrated for decision-making 3. difficult for non-experts to interpret or act on safely That gap is not a user failure. It’s a design and evaluation failure. Standard benchmarks (medical exams) and even simulated users systematically overestimate real-world safety. They measure stored knowledge, not whether a system can reliably guide action under uncertainty. And medical care is always about managing uncertainty. Humans do what humans always do: provide partial information reason under ambiguity rely on cues like consistency and clarity If an AI system degrades under those conditions, the responsibility lies with the system—not the person using it. For high-stakes domains like healthcare, “human-in-the-loop” is not a safety guarantee. Interaction itself is the risk surface. Until models are designed, tested, and regulated around real user behavior, benchmark performance will remain a misleading proxy for safety. https://lnkd.in/epT2YaEM #AI #Medicine #patients #humans
-
In UX research, we often move fast, we run studies under time pressure, collect messy data, and want answers quickly. In that environment, it is very tempting to say: let’s just run a regression, and see what comes out! The problem is that regression is not a single thing. Different regression models answer fundamentally different questions, and using the wrong one does not just reduce precision, it quietly changes the meaning of your results. What really matters is not the technique but the nature of the outcome you are trying to understand. User data behaves very differently depending on what you are measuring. A satisfaction score is not the same as a success or failure decision. A count of errors is not the same as time on task. Repeated measurements from the same user are not the same as independent observations from different people. These distinctions may sound technical, but they are actually about respecting how human behavior shows up in data. Over time, I have become much less impressed by sophisticated sounding analyses and much more focused on alignment. A simple model that matches the data generating process is far more valuable than a complex one that does not. Rigor in UX research does not come from complexity, it comes from choosing methods that fit the behavioral question you are asking and being honest about the uncertainty in the answer. When we get this right, something interesting happens. Explaining results becomes easier, not harder. Stakeholders understand what the model is saying because the model matches their intuition about user behavior. Design decisions feel more grounded because they are based on the right type of evidence. Most importantly, we stop treating statistical results as proof and start treating them as informed decisions under uncertainty. That is why I see model choice as one of the most important moments in any study. It is the point where we decide whether we are truly modeling users or just running numbers. If we slow down here and take this step seriously, everything that follows becomes clearer, more defensible, and more useful for making real product decisions. To learn more: https://lnkd.in/ga59ZJ62
-
Here’s the easiest way to make your products 10x more robust: Start treating your AI evals like user stories. Why? Because your evaluation strategy is your product strategy. Every evaluation metric maps to a user experience decision. Every failure mode triggers a designed response. Every edge case activates a specific product behavior. Great AI products aren’t just accurate; they’re resilient and graceful in failure. I recently interviewed a candidate who shared this powerful approach. He said, "𝘐 ���𝘱𝘦𝘯𝘥 𝘮𝘰𝘳𝘦 𝘵𝘪𝘮𝘦 𝘥𝘦𝘴𝘪𝘨𝘯𝘪𝘯𝘨 𝘧𝘰𝘳 𝘸𝘩𝘦𝘯 𝘈𝘐 𝘧𝘢𝘪𝘭𝘴 𝘵𝘩𝘢𝘯 𝘸𝘩𝘦𝘯 𝘪𝘵 𝘴𝘶𝘤𝘤𝘦𝘦𝘥𝘴." Why? Because 95% accuracy means your AI confidently gives wrong answers 1 in 20 times. So he builds: • Fallback flows • Confidence indicators • Easy ways for users to correct mistakes. In other words, he doesn’t try to hide AI’s limitations; he designs around them, transparently. He uses AI evaluations as his actual Product Requirements Document. Instead of vague goals like “the system should be accurate,” he creates evaluation frameworks that become product specs. For example: Evaluation as Requirements - • When confidence score < 0.7, show “I’m not sure” indicator • When user corrects AI 3x in a session, offer human handoff • For financial advice, require 2-source verification before display Failure Modes as Features - • Low confidence → Collaborative mode (AI suggests, human decides) • High confidence + wrong → Learning opportunity (capture correction) • Edge case detected → Graceful degradation (simpler but reliable response) • Bias flag triggered → Alternative perspectives offered Success Metrics Redefined - It’s not just accuracy anymore: • User trust retention after AI mistakes • Time-to-correction when AI is wrong • Percentage of users who keep using the product after errors • Rate of escalation to human support Plan for failure, and your users will forgive the occasional mistake. Treat your AI evaluations like user stories, and watch your product’s robustness soar. ♻️ Share this to help product teams build better AI products. Follow me for more practical insights on AI product leadership.
-
I love watching our customers use our product 𝗟𝗜𝗩𝗘 in their office. It’s one of the most revelatory experiences I have as a founder. At Y Combinator, we learned that the two best uses of a founder’s time are: 1) Building product 2) Talking to users We take this advice literally, and love visiting our customers in person to see them use Clueso (YC W23). It allows us to identify their problems on the spot and immediately ship features or fixes to address them. Watching users in action gives insights that no Zoom call can match. Here’s why: 1️⃣ 𝗜𝗳 𝘆𝗼𝘂𝗿 𝗽𝗿𝗼𝗱𝘂𝗰𝘁 𝗵𝗮𝘀 𝗽𝗼𝘄𝗲𝗿𝗳𝘂𝗹 𝗳𝗲𝗮𝘁𝘂𝗿𝗲𝘀 𝘁𝗵𝗮𝘁 𝘆𝗼𝘂𝗿 𝘂𝘀𝗲𝗿𝘀 𝗮𝗿𝗲 𝘂𝗻𝗮𝘄𝗮𝗿𝗲 𝗼𝗳, 𝘁𝗵𝗲𝘆 𝘄𝗼𝗻’𝘁 𝘁𝗲𝗹𝗹 𝘆𝗼𝘂 𝘁𝗵𝗲𝘆 𝗱𝗶𝗱𝗻’𝘁 𝗳𝗶𝗻𝗱 𝘁𝗵𝗲𝗺. Seeing users live will clearly highlight which features are catching their attention, and which ones are going unnoticed. 2️⃣ 𝗦𝗼𝗺𝗲 𝗳𝗲𝗲𝗹 𝗲𝗺𝗯𝗮𝗿𝗿𝗮𝘀𝘀𝗲𝗱 𝘁𝗼 𝗮𝗱𝗺𝗶𝘁 𝘁𝗵𝗲𝘆 𝘀𝘁𝗿𝘂𝗴𝗴𝗹𝗲𝗱, so they will hold back on telling you how bad their experience actually was. 𝗢𝘁𝗵𝗲𝗿𝘀 𝗳𝗲𝗲𝗹 𝗹𝗶𝗸𝗲 𝗿𝗲𝗽𝗼𝗿𝘁𝗶𝗻𝗴 𝗺𝗼𝗿𝗲 𝗽𝗿𝗼𝗯𝗹𝗲𝗺𝘀 𝗺𝗮𝗸𝗲𝘀 𝘁𝗵𝗲𝗺 𝗹𝗼𝗼𝗸 𝘀𝗺𝗮𝗿𝘁𝗲𝗿, so they’ll over-emphasize minor problems that are actually not worth fixing. 3️⃣ 𝗢𝘃𝗲𝗿 𝗮 𝗰𝗮𝗹𝗹, 𝘆𝗼𝘂’𝗹𝗹 𝗻𝗼𝘁 𝗴𝗲𝘁 𝘁𝗼 𝘀𝗲𝗲 𝗮 𝘂𝘀𝗲𝗿’𝘀 𝗮𝗰𝘁𝘂𝗮𝗹 𝗿𝗲𝗮𝗰𝘁𝗶𝗼𝗻𝘀. Some customers will water down how bad the experience was so you don’t feel bad. Others will exaggerate so they appear smart. Seeing real-time reactions, frustrations, and “aha” moments tells a clearer story. All-in-all, it’s an exercise that I’d highly recommend everyone to do. It will give you deep insights into: 1) Whether the problem you’re solving is actually worthwhile 2) How well your product delivers on its promise 3) How customers actually perceive and experience your product #ux #productmanagement #startup