Top LinkedIn Content on User Testing Methods for Designers

Founder & CEO @ Infra360 | DevOps, FinOps & CloudOps Partner for FinTech, SaaS & Enterprises

19,087 followers 1y

“99.999% Uptime” is a Lie. You just don’t measure what matters. We worked with a unicorn last quarter that proudly claimed “5 nines” uptime. Their checkout service failed silently 37 times in 30 days. Here’s what I’ve seen actually destroy “perfect uptime” brag sheets: → 𝐍𝐨 𝐔𝐬𝐞𝐫 𝐉𝐨𝐮𝐫𝐧𝐞𝐲 𝐕𝐚𝐥𝐢𝐝𝐚𝐭𝐢𝐨𝐧 You test your endpoints, not your flows. The API response, but the user can’t complete a transaction. Still counts as “up,” right? → 𝐇𝐞𝐚𝐥𝐭𝐡 𝐂𝐡𝐞𝐜𝐤𝐬 𝐓𝐡𝐚𝐭 𝐋𝐢𝐞 “/health” returns 200. Meanwhile, your Redis is choking and async workers are deadlocked. But hey, 5 nines! → 𝐍𝐨 𝐃𝐞𝐩𝐞𝐧𝐝𝐞𝐧𝐜𝐲 𝐀𝐰𝐚𝐫𝐞𝐧𝐞𝐬𝐬 Your core service is up, but its upstream auth service is failing silently. Now 11 other services are returning 500s downstream. Nobody tracks the blast radius. → 𝐏𝐨𝐬𝐭𝐦𝐨𝐫𝐭𝐞𝐦𝐬 𝐖𝐢𝐭𝐡𝐨𝐮𝐭 𝐁𝐮𝐬𝐢𝐧𝐞𝐬𝐬 𝐌𝐞𝐭𝐫𝐢𝐜𝐬 Incidents are closed when systems recover (not when revenue recovers). That’s why marketing keeps yelling, “Why is conversion down?” and DevOps goes, “Looks fine to me.” → 𝐍𝐨 𝐑𝐞𝐚𝐥 𝐒𝐋𝐎𝐬 SLOs aren't dashboards. They're promises to your users. If your SLOs don’t start with customer experience, they’re just vanity metrics. 𝐇𝐞𝐫𝐞’𝐬 𝐰𝐡𝐚𝐭 𝐰𝐞 𝐜𝐡𝐚𝐧𝐠𝐞𝐝 𝐟𝐨𝐫 𝐭𝐡𝐚𝐭 𝐮𝐧𝐢𝐜𝐨𝐫𝐧: ✓ Synthetic testing of real user flows, not just endpoints ✓ Dependency mapping on all critical paths ✓ Customer-facing SLOs tied to revenue-impacting flows ✓ RCA reviews that include business teams If your dashboards scream “all green” but your users feel red, it’s time for a different lens. Ask yourself… Are you tracking uptime, or outcomes? Would love to hear what metrics you trust over “99.999%.” ♻️ 𝐑𝐄𝐏𝐎𝐒𝐓 𝐒𝐨 𝐎𝐭𝐡𝐞𝐫𝐬 𝐂𝐚𝐧 𝐋𝐞𝐚𝐫𝐧.

Anshita Bhasin

Technical Product & Program Leader | Bridging Business, AI & Engineering | Enterprise Platform Transformation | Speaker & Tech Creator

34,789 followers 1y

Testing with AI: Post 4 ------------------------- Continuing my Testing with AI series, today I’m sharing another amazing feature recently introduced by the KushoAI team, which is UI Testing. Their innovative Chrome extension simplifies end-to-end testing workflows by allowing you to select UI elements directly from your browser. With the power of large language models (LLMs), it enables you to: (i) Generate test ideas tailored to your application's functionality. (ii) Write automation scripts—currently supporting Playwright (.ts files), with plans to expand to Selenium and Cypress soon. (iii) Download the complete test project to your local machine for execution. My Experience with Kusho AI’s UI Testing Here’s what impressed me the most: (1) Smart Element Detection Simply click on UI elements in your application, and Kusho AI instantly identifies and processes them for testing. (2) Comprehensive Test Case Generation It automatically generates both functional and edge test cases, eliminating the need for manual test creation. (3) Automation Code Ready to Use For each test case, Kusho AI provides ready-made Playwright scripts. Support for Cypress, Selenium, and other tools is coming soon! (4) Real-Time Test Enhancements (A standout feature!) You can generate additional test cases on demand, based on your selected page or component. (5) Great for Beginners & Experts If you're new to automation, Kusho AI provides a solid starting point. For experienced testers, it saves time by automating repetitive tasks and ensuring robust test coverage. Instead of worrying about AI taking over, let’s leverage tools like Kusho AI to automate tedious tasks and focus on higher-value testing strategies. Try it out and see how much time and effort you can save! If you find it useful, repost to help others in the testing community. Link - https://kusho.ai/ P.S. Kusho AI is free for individual users, with an Enterprise model available for larger teams. I’ve attached some screenshots to give you a glimpse of its capabilities!

+2

8 Comments

Prafful Agarwal

Software Engineer at Google

33,117 followers 1y

How Big Tech Tests in Production Without Breaking Everything Most outages happen because changes weren’t tested under real-world conditions before deployment. Big tech companies don’t gamble with production. Instead, they use Testing in Production (TiP)—a strategy that ensures new features and infrastructure work before they go live for all users. Let’s break down how it works. 1/ Shadow Testing (Dark Launching) This is the safest way to test in production without affecting real users. # How it works: - Incoming live traffic is mirrored to a shadow environment that runs the new version of the system. - The shadow system processes requests but doesn’t return responses to actual users. - Engineers compare outputs from old vs. new systems to detect regressions before deployment. # Why is this powerful? - It validates performance, correctness, and scalability with real-world traffic patterns. - No risk of breaking the user experience while testing. - Helps uncover unexpected edge cases before rollout. 2/ Synthetic Load Testing – Simulating Real-World Usage Sometimes, using real user traffic isn’t feasible due to privacy regulations or data sensitivity. Instead, engineers generate synthetic requests that mimic real-world usage patterns. # How it works: - Scripted requests are sent to production-like environments to simulate actual user interactions. - Engineers analyze response times, bottlenecks, and potential crashes under heavy load. - Helps answer: - How does the system perform under high concurrency? - Can it handle sudden traffic spikes? - Are there any memory leaks or slowdowns over time? 🔹 Example: Netflix generates synthetic traffic to test how its recommendation engine scales during peak usage. 3/ Feature Flags & Gradual Rollouts – Controlled Risk Management The worst thing you can do? Deploy a feature to all users at once and hope it works. Big tech companies avoid this by using feature flags and staged rollouts. # How it works: - New features are rolled out to a small percentage of users first (1% → 10% → 50% → 100%). - Engineers monitor error rates, performance, and feedback. - If something goes wrong, they can immediately roll back without affecting everyone. # Why is this powerful? - Minimizes risk—only a fraction of users are affected if a bug is found. - Engineers get real-world validation in a controlled way. - Allows A/B testing to compare the impact of new vs. old behavior. 🔹 Example: - Facebook uses feature flags to release new UI updates to a limited user group first. - If engagement drops or errors spike, they disable the feature instantly. Would you rather catch a bug before or after it takes down your system?

7 Comments

Elvis S.

Founder at DAIR.AI | Angel Investor | Advisor | Prev: Meta AI, Galactica LLM, Elastic, Ph.D. | Serving 7M+ learners around the world

86,482 followers 1y

AgentA/B is a fully automated A/B testing framework that replaces live human traffic with large-scale LLM-based agents. These agents simulate realistic, intention-driven user behaviors on actual web environments, enabling faster, cheaper, and risk-free UX evaluations, even on real websites like Amazon. Key Insights: • Modular agent simulation pipeline – Four components—agent generation, condition prep, interaction loop, and post-analysis—allow plug-and-play simulations on live webpages using diverse LLM personas. • Real-world fidelity – The system parses live DOM into JSON, enabling structured interaction loops (search, filter, click, purchase) executed via LLM reasoning + Selenium. • Behavioral realism – Simulated agents show more goal-directed but comparable interaction patterns vs. 1M real Amazon users (e.g., shorter sessions but similar purchase rates). • Design sensitivity – A/B test comparing full vs. reduced filter panels revealed that agents in the treatment condition clicked more, used filters more often, and purchased more. • Inclusive prototyping – Agents can represent hard-to-reach populations (e.g., low-tech users), making early-stage UX testing more inclusive and risk-free. • Notable results: - Simulated 1,000 LLM agents with unique personas in a live Amazon shopping scenario. - Agents in the treatment condition spent more ($60.99 vs. $55.14) and purchased more products (414 vs. 404), confirming the utility of interface changes. - Behavioral alignment with humans was strong enough to validate simulation-based testing. - Only the purchase count difference reached statistical significance, suggesting further sample scaling is needed. AgentA/B shows how LLM agents can augment — not replace — traditional A/B testing by offering a new pre-deployment simulation layer. This can accelerate iteration, reduce development waste, and support UX inclusivity without needing immediate live traffic.

6 Comments

Jaime Teevan

Chief Scientist & Technical Fellow at Microsoft - for speaking requests please contact teevan-externalopps@microsoft.com

22,117 followers 7mo

Aligning AI with human preferences typically requires collecting a lot of explicit feedback, which can be costly and not reflective of real-world usage. But there are many signals already embedded in our everyday interactions with AI. It turns out that the casual “thanks” or “wait a sec” moments in a chat can be just as valuable when training a model as formal ratings – if we know how to use them. 📖 WildFeedback: Aligning LLMs With In‑situ User Interactions and Feedback (https://lnkd.in/gxGyb-ig), by Taiwei Shi, Zhuoer Wang, Longqi Yang, Ying-Chun Lin, Zexue He, Mengting Wan, Pei Zhou, Sujay Kumar Jauhar, Sihao Chen, Freddie Zhang, Jieyu Zhao, Xiaofeng Xu, Xia Song, and Jennifer Neville. NeurIPS 2024 Workshop. What’s novel in this paper is not just that it incorporates human feedback, but how it does so. The authors turn weak, messy signals from real conversations (implicit cues like “thanks,” “wait,” or “revise this”) into clean preference pairs at scale, and then show those signals can actually nudge the model in the right direction. This reframes alignment from a one‑off RLHF sprint into an ongoing, in‑situ dialog with users. The paper is exceptionally well grounded in real data (mining 20,281 preference pairs from 148,715 multi‑turn chats), and complements the usual benchmark tests with a checklist‑guided evaluation. A good template if you’re thinking about continuous AI alignment in everyday use. #BeyondTheAbstract #NeurIPS2024 #AIAlignment #OAR #AppliedResearch

WildFeedback: Aligning LLMs With In-situ User Interactions And Feedback arxiv.org

4 Comments

Marko Sarstedt

16,080 followers 6mo

In our new study, published in transfer – Zeitschrift für Kommunikation und Markenmanagement, Alexander Rüdiger Daum, Stephan Pauli (rpc - The Retail Performance Company), and I (Ludwig-Maximilians-Universität München; LMU Munich School of Management) explore, how large language models (#LLMs) can transform how companies measure customer experience (#CX). 🤖🤖🤖 Traditional surveys like NPS are costly, slow, and limited in scope. By contrast, analyzing user-generated content (e.g., Google reviews) with LLMs enables real-time insights, scalable benchmarking, and early detection of emerging themes. Using GPT-4o, we show that AI-based CX ratings align closely with expert evaluations, offering a fast, low-cost complement to surveys. Additional showcase applications include multi-location benchmarking, retail concept evaluation, and market-wide satisfaction mapping. Our key takeaway: LLMs don’t replace human judgment—they enhance it. When combined with expert validation and continuous feedback loops, LLMs can make CX analytics smarter, faster, and more actionable. 📄 The full article is accessible via the Ebsco and Genios databases - or PM me! 😉 #ScienceMeetsPractice #Marketing #ConsumerBehavior

3 Comments

Sudarshan Lamkhede

AI/ML Leader @ Meta | ex-Netflix | Search and Recommender Systems, Personalization, Ads

19,615 followers 1mo

I have been thinking about building self-improving steerable recommender systems with LLM agents. Of course, the brilliant minds have already started to think along that direction. Among that SimUSER comes closer to what I am imagining. Key pieces are discussed in the SimUSER paper https://lnkd.in/gkW_SP-m by Nicolas Bougie and Narimasa Watanabe They propose an agent framework to construct user personas from historical data and then use those agents to simulate interactions with a recommender system, conduct offline A/B tests, yielding a better directional alignment with real user A/B tests than other frameworks. I think this can be made extended further: For systems that are starting fresh, simulate users based on your (i.e the "builders'") understanding of your addressable user cohorts. Simulate. Pick a few winning variations from offline exploration done by the agents. Deploy them in the real world. See how users react. Record. Let the simulation refine itself. Repeat. We can make the agents to optimize their how well they fit with real world observations. You burn your token budgets but you could significantly shorten time to improve. The software development times have shortened, and so would the AB cycles. If you use powerful models they can also interact with UI designed for humans (though, I am not sure whether they can "simulate" real humans in that perspective, yet). Humans remain in the middle via the real world AB tests and some light weight validation from builders before allocating real users to the AB tests. Instead of "age, personality, and occupation" build a textual description of what each of your users like. It can be surfaced back to human users as their preferences. These can be further edited by the human users to "steer" the recommendations in the direction they want. An after thought: Do we really need to design search and recommender systems (user experience/interface included) for humans in the future? Increasingly LLM agents are acting on behalf of their human owners including interacting with these systems (e.g. agents shopping). If we need to target LLM agents are the primary population of consumers of search results and recommendations, what would have to be different? #aiagents #recommendersystems #search #llm

6 Comments

Ruslan Desyatnikov

53,560 followers 1y

When a fast-growing streaming client asked us to test their video recommendation engine, their ask sounded simple: "Just make sure it works." 👍 The APIs responded. 👍 The load times were fast. 👍 The automation scripts passed. ......And yet, users hated the recommendations. They felt off. Irrelevant. Even frustrating. That's when we stepped in with Human Intelligence Software Testing (HIST), which at the time was still in its proof-of-concept stage and the difference was night and day. 👎Automation couldn't tell if a thriller was being served to a child's profile. 👎It couldn't ask, "Why is a Spanish user getting English content?" 👎It definitely didn't catch the loop of crime dramas after one documentary binge. This wasn't just about test cases. It was about testers who think like users, not robots. After we applied HIST: Satisfaction scores jumped by 27% Session times increased Bounce rates dropped We didn't just test functionality, we validated experience. *******If you're curious how we did it, and what real-world scenarios we tested read the article below *****

How We Used Human Intelligence to Test a Video Recommendation Engine Ruslan Desyatnikov on LinkedIn

2 Comments

Tatyana Arbouzova

14,948 followers 5mo

If you’re a tester and still taking pride primarily in finding functional bugs, you’re focusing on the wrong problem. Modern software systems are complex. Most serious issues today don’t come from a button not working—they come from integration gaps, broken end-to-end flows, and real customer journeys failing across services. In reality, quality engineers are usually: --Embedded within a single dev team --Owning one service, feature, or a narrow slice of a product --Reporting either to a test manager or an engineering leader responsible for that service What’s rare? Quality engineers owning the full end-to-end customer experience across products and services: ++Analyzing real customer behavior ++Designing scenarios that reflect how users actually use the system ++Ensuring the most painful customer workflows are protected and prioritized by PMs and dev teams If your team already does this—congratulations. You’re among the rare ones. If not, this is where we need tester eyes the most. And functional bugs? Those should be caught by developers. Code should work in isolation by default. The era of “functional testers” is behind us. Today—and tomorrow—belongs to integration and end-to-end quality engineering. #QualityEngineering #SoftwareTesting #EndToEndTesting #IntegrationTesting #TestAutomation #QE #EngineeringLeadership #ProductQuality #CustomerExperience #DevQuality #ShiftLeft #ModernTesting #AIinTesting #AgenticQuality #TechLeadership

1 Comment

Karun Thankachan

Senior Data Scientist @ Walmart (ex-FAANG) | Building & Explaining Applied ML, Agentic AI & RecSys Systems

98,026 followers 3w

The rise of LLMs like ChatGPT is changing how users interact with systems. Instead of browsing or clicking through items, users now express intent directly in natural language, often with explicit constraints and goals. This exposes a gap in traditional recommendation systems. Methods like Matrix Factorization assume preferences can be learned from historical interactions and encoded into latent representations, which works for “more of the same” recommendations. But LLM-shaped behavior is different, where users ask complex queries like “a durable laptop for graphic design under $1500,” turning the problem into reasoning over constraints rather than ranking past behavior. As recommenders evolve into LLM-style interactive assistants, evaluation needs to catch up. The authors found that standard recommendation datasets (like Movielens or Amazon Beauty) lack the high-quality textual queries needed to test LLMs,. To bridge this gap, they created RecBench+, which includes ~30,000 queries categorized into two main types - 1/ Condition-based Queries: These test the model's ability to follow specific constraints which can be explicit (e.g., "movies featuring Gwyneth Paltrow") or implicit (e.g., "movies with the same cinematographer as Stay Hungry," which requires first identifying the cinematographer as David Worth) or event Misinformed (e.g., asking for movies directed by Spielberg but referencing Avatar) 2/ User Profile-based Queries: These test personalization based on interests (inferred from interaction history) or demographics (age, gender, occupation) So what were the takeaways? 𝐑𝐞𝐚𝐬𝐨𝐧𝐢𝐧𝐠 𝐌𝐨𝐝𝐞𝐥𝐬 𝐚𝐫𝐞 𝐒𝐮𝐩𝐞𝐫𝐢𝐨𝐫 𝐟𝐨𝐫 "𝐈𝐦𝐩𝐥𝐢𝐜𝐢𝐭" 𝐓𝐚𝐬𝐤𝐬: While standard LLMs (like GPT-4o) are strong at explicit conditions, models with advanced reasoning capabilities (like DeepSeek-R1) perform significantly better on implicit and misinformed queries. For instance, DeepSeek-R1 can "think through" whether a mentioned cinematographer is actually correct before making a recommendation 𝐓𝐡𝐞 𝐇𝐢𝐬𝐭𝐨𝐫𝐲/𝐂𝐨𝐧𝐬𝐭𝐫𝐚𝐢𝐧𝐭 𝐓𝐫𝐚𝐝𝐞-𝐨𝐟𝐟: Incorporating a user's interaction history improves Precision because it helps filter large pools of potential candidates. However, it can actually lower perf because the model might get "distracted" by the user's historical preferences and ignore the specific constraints of the current query 𝐎𝐩𝐭𝐢𝐦𝐢𝐳𝐚𝐭𝐢𝐨𝐧 𝐯𝐢𝐚 𝐓𝐰𝐨-𝐒𝐭𝐚𝐠𝐞 𝐅𝐢𝐧𝐞-𝐓𝐮𝐧𝐢𝐧𝐠: The best performance gains were seen using a two-stage approach: Supervised Fine-Tuning (SFT) to "warm up" the model, followed by Reinforcement Fine-Tuning (RFT),. RFT alone was less effective, suggesting the model needs the SFT phase to learn the basic task structure before it can effectively explore and refine its reasoning. Check out the full paper here: https://lnkd.in/e_VbHU3N

User Testing Methods for Designers

More in User Testing Methods for Designers

More Design topics

Explore categories