Don’t Focus Too Much On Writing More Tests Too Soon 📌 Prioritize Quality over Quantity - Make sure the tests you have (and this can even be just a single test) are useful, well-written and trustworthy. Make them part of your build pipeline. Make sure you know who needs to act when the test(s) should fail. Make sure you know who should write the next test. 📌 Test Coverage Analysis: Regularly assess the coverage of your tests to ensure they adequately exercise all parts of the codebase. Tools like code coverage analysis can help identify areas where additional testing is needed. 📌 Code Reviews for Tests: Just like code changes, tests should undergo thorough code reviews to ensure their quality and effectiveness. This helps catch any issues or oversights in the testing logic before they are integrated into the codebase. 📌 Parameterized and Data-Driven Tests: Incorporate parameterized and data-driven testing techniques to increase the versatility and comprehensiveness of your tests. This allows you to test a wider range of scenarios with minimal additional effort. 📌 Test Stability Monitoring: Monitor the stability of your tests over time to detect any flakiness or reliability issues. Continuous monitoring can help identify and address any recurring problems, ensuring the ongoing trustworthiness of your test suite. 📌 Test Environment Isolation: Ensure that tests are run in isolated environments to minimize interference from external factors. This helps maintain consistency and reliability in test results, regardless of changes in the development or deployment environment. 📌 Test Result Reporting: Implement robust reporting mechanisms for test results, including detailed logs and notifications. This enables quick identification and resolution of any failures, improving the responsiveness and reliability of the testing process. 📌 Regression Testing: Integrate regression testing into your workflow to detect unintended side effects of code changes. Automated regression tests help ensure that existing functionality remains intact as the codebase evolves, enhancing overall trust in the system. 📌 Periodic Review and Refinement: Regularly review and refine your testing strategy based on feedback and lessons learned from previous testing cycles. This iterative approach helps continually improve the effectiveness and trustworthiness of your testing process.
Testing and Evaluation in Product Development
Explore top LinkedIn content from expert professionals.
Summary
Testing and evaluation in product development means examining new ideas, prototypes, or finished products to see if they work as intended and meet customer needs before investing in full-scale production. This process reduces risk, uncovers hidden issues, and helps companies build solutions that are actually useful.
- Assess real-world risks: Always test for value, usability, viability, and feasibility with real users to avoid building features that miss the mark.
- Integrate continuous feedback: Build regular review and testing into every stage so you can catch and fix problems early—before they become costly.
- Use diverse testing methods: Mix human evaluation, automated checks, and prototype testing to gain a complete picture of how your product performs.
-
-
How Big Tech Tests in Production Without Breaking Everything Most outages happen because changes weren’t tested under real-world conditions before deployment. Big tech companies don’t gamble with production. Instead, they use Testing in Production (TiP)—a strategy that ensures new features and infrastructure work before they go live for all users. Let’s break down how it works. 1/ Shadow Testing (Dark Launching) This is the safest way to test in production without affecting real users. # How it works: - Incoming live traffic is mirrored to a shadow environment that runs the new version of the system. - The shadow system processes requests but doesn’t return responses to actual users. - Engineers compare outputs from old vs. new systems to detect regressions before deployment. # Why is this powerful? - It validates performance, correctness, and scalability with real-world traffic patterns. - No risk of breaking the user experience while testing. - Helps uncover unexpected edge cases before rollout. 2/ Synthetic Load Testing – Simulating Real-World Usage Sometimes, using real user traffic isn’t feasible due to privacy regulations or data sensitivity. Instead, engineers generate synthetic requests that mimic real-world usage patterns. # How it works: - Scripted requests are sent to production-like environments to simulate actual user interactions. - Engineers analyze response times, bottlenecks, and potential crashes under heavy load. - Helps answer: - How does the system perform under high concurrency? - Can it handle sudden traffic spikes? - Are there any memory leaks or slowdowns over time? 🔹 Example: Netflix generates synthetic traffic to test how its recommendation engine scales during peak usage. 3/ Feature Flags & Gradual Rollouts – Controlled Risk Management The worst thing you can do? Deploy a feature to all users at once and hope it works. Big tech companies avoid this by using feature flags and staged rollouts. # How it works: - New features are rolled out to a small percentage of users first (1% → 10% → 50% → 100%). - Engineers monitor error rates, performance, and feedback. - If something goes wrong, they can immediately roll back without affecting everyone. # Why is this powerful? - Minimizes risk—only a fraction of users are affected if a bug is found. - Engineers get real-world validation in a controlled way. - Allows A/B testing to compare the impact of new vs. old behavior. 🔹 Example: - Facebook uses feature flags to release new UI updates to a limited user group first. - If engagement drops or errors spike, they disable the feature instantly. Would you rather catch a bug before or after it takes down your system?
-
One of the most common mistakes teams make when evaluating early product features is asking users whether they like an idea and treating the answer as evidence. Decades of behavioral research and very practical product research work show that this is a weak signal. People are generally bad at predicting what they will use, adopt, or pay for in the future, especially when there is no cost, effort, or tradeoff attached to their answer. That is why early feature evaluation should focus on behavior rather than belief. When a feature is only a concept, a smoke test can already tell you a lot. Exposing users to the idea through a landing page, announcement, or waitlist and observing whether they click or sign up answers a very specific question. Is this worth building at all, not whether it sounds good in theory. When an idea becomes clickable, fake door tests bring the decision closer to real behavior. Placing a realistic entry point inside the product and observing who actually tries to use it shows intent in context. The power of this method comes from the fact that users believe the feature is real at the moment of interaction. Transparency afterward is essential, but the action itself is the signal. For complex or technically risky features, especially AI, automation, or recommendation systems, Wizard of Oz prototyping allows teams to observe natural behavior before automation exists. Users interact with what looks like a fully functional system, while a human performs the work behind the scenes. This reveals expectations, decision making, and breakdowns that are invisible in abstract discussions. Concierge MVPs go one step further by making the human involvement explicit. Here, the value is delivered manually, often in a high touch way, to see whether users actually engage, return, and benefit. If people do not use or value the service when friction is low and quality is high, automation will not fix the underlying problem. Across all of these approaches, the principle is the same. Early feature evaluation should not ask people what they like. It should watch what they do when a real opportunity to engage is placed in front of them.
-
One of the hottest topics in AI is evals (evaluations). Effective Humans + AI assessment of outputs is essential for building scalable self-improving products. Here is the case being laid out for evals in product development. 🔥 Evals are the hidden lever of AI product success. Evaluations—not prompts, not model choice—are what separate mediocre AI products from exceptional ones. Industry leaders like Kevin Weil (OpenAI), Mike Krieger (Anthropic), and Garry Tan (YC) all call evals the defining skill for product managers. 🧭 Evals define what “good” means in AI. Unlike traditional software tests with binary pass/fail outcomes, AI evals must measure subjective qualities like accuracy, tone, coherence, and usefulness. Good evals act like a “driving test,” setting criteria across awareness, decision-making, and safety. ⚙️ Three core approaches dominate evals. PMs rely on three methods: human evals (direct but costly), code-based evals (fast but limited to deterministic checks), and LLM-as-judge evals (scalable but probabilistic). The strongest systems blend them—human judgments set the gold standard, while LLM judges extend coverage and scalability. 📐 Every strong eval has four parts. Effective evals set the role, provide the context, define the goal, and standardize labels/scoring. Without this structure, evals drift into vague “vibe checks.” 🔄 The eval flywheel drives iteration speed. The intention should be to drive a positive feedback loop where evals enable debugging, fine-tuning, and synthetic data generation. This cycle compounds over time, becoming a moat for successful AI startups. 📊 Bottom-up metrics reveal real failure modes. While common criteria include hallucination, safety, tone, and relevance, the most effective teams identify metrics directly from data. Human audits paired with automated checks help surface the real-world patterns generic metrics often miss. 👥 Human oversight keeps AI honest. LLM-as-judge systems make evals scalable, but without periodic human calibration, they drift. The most reliable products maintain a human-in-the-loop review loop—auditing eval results, correcting blind spots, and ensuring that automated judgments remain aligned with real user expectations. 📈 PMs must treat evals like product metrics. Just as PMs track funnels, churn, and retention, AI PMs must monitor eval dashboards for accuracy, safety, trust, contextual awareness, and helpfulness. Declining repeat usage, rising hallucination rates, or style mismatches should be treated as product health warnings. Some say this case is overstated, and point to the lack of reliability of evals or the relatively low current in use in AI dev pipelines. However this is largely a question of working out how to do them well, especially effectively integrating human judgment into the process.
-
The most underestimated part of building LLM applications? Evaluation. Evaluation can take up to 80% of your development time (because it’s HARD) Most people obsess over prompts. They tweak models. Tune embeddings. But when it’s time to test whether the whole system actually works? That’s where it breaks. Especially in agentic RAG systems - where you’re orchestrating retrieval, reasoning, memory, tools, and APIs into one seamless flow. Implementation might take a week. Evaluation takes longer. (And it’s what makes or breaks the product.) Let’s clear up a common confusion: 𝗟𝗟𝗠 𝗲𝘃𝗮𝗹𝘂𝗮𝘁𝗶𝗼𝗻 ≠ 𝗥𝗔𝗚 𝗲𝘃𝗮𝗹𝘂𝗮𝘁𝗶𝗼𝗻. LLM eval tests reasoning in isolation - useful, but incomplete. In production, your model isn’t reasoning in a vacuum. It’s pulling context from a vector DB, reacting to user input, and shaped by memory + tools. That’s why RAG evaluation takes a system-level view. It asks: Did this app respond correctly, given the user input and the retrieved context? Here’s how to break it down: 𝗦𝘁𝗲𝗽 𝟭: 𝗘𝘃𝗮𝗹𝘂𝗮𝘁𝗲 𝗿𝗲𝘁𝗿𝗶𝗲𝘃𝗮𝗹. → Are the retrieved docs relevant? Ranked correctly? → Use LLM judges to compute context precision and recall → If ranking matters, compute NDCG, MRR metrics → Visualize embeddings (e.g. UMAP) 𝗦𝘁𝗲𝗽 𝟮: 𝗘𝘃𝗮𝗹𝘂𝗮𝘁𝗲 𝗴𝗲𝗻𝗲𝗿𝗮𝘁𝗶𝗼𝗻. → Did the LLM ground its answer in the right info? → Use heuristics, LLM-as-a-judge, and contextual scoring. In practice, treat your app as a black box and log: - User query - Retrieved context - Model output - (Optional) Expected output This lets you debug the whole system, not just the model. 𝘏𝘰𝘸 𝘮𝘢𝘯𝘺 𝘴𝘢𝘮𝘱𝘭𝘦𝘴 𝘢𝘳𝘦 𝘦𝘯𝘰𝘶𝘨𝘩? 5–10? Too few. 30–50? Good start. 400+? Now you’re capturing real patterns and edge cases. Still, start with how many samples you have available, and keep expanding your evaluation split. It’s better to have an imperfect evaluation layer than nothing. Also track latency, cost, throughput, and business metrics (like conversion or retention). Some battle-tested tools: → RAGAS (retrieval-grounding alignment) → ARES (factual grounding) → Opik by Comet (end-to-end open-source eval + monitoring) → Langsmith, Langfuse, Phoenix (observability + tracing) TL;DR: Agentic systems are complex. Success = making evaluation part of your design from Day 0. We unpack this in full in Lesson 5 of the PhiloAgents course. 🔗 Check it out here: https://lnkd.in/dA465E_J
-
I am tired of watching companies waste millions building software products nobody uses. The pattern is always the same: Have an idea. Build it. Launch it. Crickets. Here is the uncomfortable reality... Decades of data show that roughly 70-90% of features built in traditional models fail to deliver business results. Not because people aren't trying. Because they're building first and learning second. Every failed feature has two costs: 𝗗𝗶𝗿𝗲𝗰𝘁 𝗰𝗼���𝘁 (expensive engineering time) 𝗢𝗽𝗽𝗼𝗿𝘁𝘂𝗻𝗶𝘁𝘆 𝗰𝗼𝘀𝘁 (what you could have built instead) Smart companies get evidence before building. They assess four risks: 𝗩𝗮𝗹𝘂𝗲 → Will customers actually buy and use it? 𝗨𝘀𝗮𝗯𝗶𝗹𝗶𝘁𝘆 → Can users figure out how it works? 𝗩𝗶𝗮𝗯𝗶𝗹𝗶𝘁𝘆 → Can we sustain it financially? 𝗙𝗲𝗮𝘀𝗶𝗯𝗶𝗹𝗶𝘁𝘆 → Do we have the technical skills to build it? The companies reduce their known risks through the Product Discovery practice of experimentation. Before spending months building, spend days testing the ideas. ➡️ Create a prototype ➡️ Test the four risk areas with real users ➡️ Measure the results ➡️ Be transparent about how to move forward Learn what works in weeks, not quarters. Product discovery is the difference between: Building 10 features where 7-9 fail, or... Testing 10 ideas rapidly, building only the 1-3 that show promise Same engineering capacity. Dramatically different outcomes. 𝗬𝗼𝘂𝗿 𝗷𝗼𝗯 𝗶𝘀𝗻'𝘁 𝘁𝗼 𝗯𝘂𝗶𝗹𝗱 𝗳𝗲𝗮𝘁𝘂𝗿𝗲𝘀 𝗾𝘂𝗶𝗰𝗸𝗹𝘆. 𝗬𝗼𝘂𝗿 𝗷𝗼𝗯 𝗶𝘀 𝘁𝗼 𝘀𝗼𝗹𝘃𝗲 𝗰𝘂𝘀𝘁𝗼𝗺𝗲𝗿 𝗽𝗿𝗼𝗯𝗹𝗲𝗺𝘀 𝘄𝗵𝗶𝗹𝗲 𝗺𝗶𝗻𝗶𝗺𝗶𝘇𝗶𝗻𝗴 𝘄𝗮𝘀𝘁𝗲. What's preventing your teams from testing before building?
-
Product development entails inherent risks where hasty decisions can lead to losses, while overly cautious changes may result in missed opportunities. To manage these risks, proposed changes undergo randomized experiments, guiding informed product decisions. This article, written by Data Scientists from Spotify, outlines the team’s decision-making process and discusses how results from multiple metrics in A/B tests can inform cohesive product decisions. A few key insights include: - Defining key metrics: It is crucial to establish success, guardrail, deterioration, and quality metrics tailored to the product. Each type serves a distinct purpose—whether to enhance, ensure non-deterioration, or validate experiment quality—playing a pivotal role in decision-making. - Setting explicit rules: Clear guidelines mapping test outcomes to product decisions are essential to mitigate metric conflicts. Given metrics may show desired movements in different directions, establishing rules beforehand prevents subjective interpretations during scientific hypothesis testing. - Handling technical considerations: Experiments involving multiple metrics raise concerns about false positive corrections. The team advises applying multiple testing corrections for success metrics but emphasizes that this isn't necessary for guardrail metrics. This approach ensures the treatment remains significantly non-inferior to the control across all guardrail metrics. Additionally, the team proposes comprehensive guidelines for decision-making, incorporating advanced statistical concepts. This resource is invaluable for anyone conducting experiments, particularly those dealing with multiple metrics. #datascience #experimentation #analytics #decisionmaking #metrics – – – Check out the "Snacks Weekly on Data Science" podcast and subscribe, where I explain in more detail the concepts discussed in this and future posts: -- Spotify: https://lnkd.in/gKgaMvbh -- Apple Podcast: https://lnkd.in/gj6aPBBY -- Youtube: https://lnkd.in/gcwPeBmR https://lnkd.in/gewaB9qC
-
Why Every Product Manager Needs A/B Testing 🚀 Imagine cooking up a recipe for the perfect product feature. Would you trust your instincts blindly, or would you test different ingredients to get the best taste? That’s where A/B testing comes in. It’s the secret sauce that helps Product Managers make data-driven decisions with confidence. Here’s everything you need to know to master A/B testing: ❓ What is A/B Testing❓ A/B testing is the process of comparing two or more versions of a product to determine which one performs better. The versions might differ in small ways - a new button design, a revamped landing page, or an updated pricing structure but the impact on user behaviour can be monumental. This method helps you validate assumptions, optimize user experiences, and ensure every product decision adds value. ⚙️ How to Conduct a Successful A/B Test? ⚙️ 🔹 Set Clear Goals Ask yourself what are you trying to improve? It could be anything from conversion rates to user satisfaction. Your goal is your North Star. 🔹 Choose the Right Metrics Metrics like click-through rates (CTR), time spent on a page, or purchase frequency will guide you in evaluating success. 🔹 Hypothesize Frame your test with a simple prediction. Example: “I believe changing the CTA button color from blue to green will increase clicks by 15%.” 🔹 Design Your Experiment Define your control group (current version) and treatment group (variant to test), ensuring a large enough sample size for reliable results. Run the test for a sufficient duration to capture meaningful patterns and user behaviour. 🔹 Analyze & Implement Use tools like Google Optimize or Optimizely to analyze results and determine statistical significance. Roll out the winning variant confidently, or refine your hypothesis for future iterations if results are inconclusive. ♻️ Four Types of A/B Tests Every PM Should Know ♻️ 1️⃣ Feature Testing: Validate hypotheses for new features pre-launch. 2️⃣ Live Testing: Fine-tune existing features already in the wild. 3️⃣ Trapdoor Testing: Redirect traffic between variants dynamically. 4️⃣ Multi-Armed Bandit (MAB): Let machine learning allocate traffic to better-performing variants in real-time. ❌ Common Pitfalls to Avoid ❌ 1️⃣ Testing trivial changes that won’t move the needle. 2️⃣ Ignoring sample size requirements—small audiences lead to inaccurate conclusions. 3️⃣ Treating A/B testing as a one-off exercise. Optimization is an ongoing journey. What’s been your most surprising A/B testing discovery? Let’s discuss in the comments!👇 Ready to embark on an exhilarating journey into the heart of product management? I’ve recently launched a cohort that is focused on teaching end-to-end product management as well as providing career placement opportunities! 🧠 Fill in the form in the comments to register your interest in the cohort and I’ll reach out to you with further details. ✍️ #ProductManagement #ABTesting #PMTools #ContinuousOptimization
-
You cannot guess what to do with your website or app. No feature, no matter how big or small it is, how much of a 'quick win' or 'no brainer' you think it is, how much everyone 'likes' it, or even how much research and data you based it on, has any guarantee of actually working. Many will have the direct opposite effect that you think they will. Product development is therefore an enormous risk. It's a risk because you are very likely to spend money that is pointless, but also because you might actually damage your business at the same time. You cannot afford these risks! Even when product teams try to incorporate testing into their plans, this is almost never effective because the right processes are not embedded. Testing a feature after it is built does not help if it does not work and you just spent a ton of time building it. In order to de-risk your investment in this area, you need a process whereby every single idea is initially subjected to the same question: > What is the smallest possible thing we can test/analyse that validates the assumption(s) within this idea? < There is ALWAYS a way to do this, no matter how big and complicated the idea might seem. If your initial test works, how do you evolve that to something slightly bigger? If THAT works, what is the MVP version, and so on? Only by taking this approach can you avoid the risk of wasting money and damaging your business. "But it will slow us down!" - Why do you want to be fast to market with the wrong thing and waste a load of money?? Also, it is actually far faster because all your ideas can be tested very simply and quickly rather than sitting in a 'roadmap' for 2 years. #cro #experimentation #ecommerce #digitalmarketing #ux #userexperience #productdevelopment
-
💎 Research Methods Comparison Matrix This matrix created by John Hu, PhD and his team outlines various research methods commonly used in product design to gather insights about products and users. It comes very handy when you want to quickly compare multiple methods based on key characteristics, including their qualitative or quantitative nature, where they are used in the product lifecycle, their strengths and weaknesses, common mistakes, and the typical deliverables they produce. 🍏 Qualitative vs. Quantitative Qualitative methods are good for the exploration and gathering subjective insights, while quantitative methods work for measurable statistical signals. 🍏 When to use in the product lifecycle? Shows which stages of the product lifecycle each method is most suitable for: ✔ Understand: Early research to understand user needs ✔ Prototype: Testing concepts ✔ Build: Developing & refining the product ✔ Launch: Post-launch evaluation 🍏 Strengths vs. Weaknesses Highlights the main advantages of each method. For instance, usability testing is great for gathering instant feedback and direct observations, while surveys are great for large, representative datasets with quick turnaround. The matrix also lists the challenges or limitations of each method. For example, in-depth interviews are time and effort-intensive; eye tracking is not effective for a broad-scope analysis. 🍏 Typical deliverables Describes the expected outputs of each method, such as use cases, quotes, videos for usability testing and personas and journey maps for ethnographic methods. #UX #research #uxresearch #design #productdesign #uxdesign