🧵 Why the obsession with p < 0.05 is hurting science. 1/ This meme says it all. p = 0.0501? Pain. p = 0.0499? Pure euphoria. Two numbers. Nearly identical. Yet we treat them like night and day. Why? 2/ The 0.05 p-value threshold is arbitrary. It came from R.A. Fisher in the 1920s. And we’ve been worshipping it like a sacred line ever since. But it’s not magic. It's convention. 3/ What does p = 0.05 actually mean? It means: If the null hypothesis is true, there’s a 5% chance we’d see this extreme of a result by random chance. That’s it. Not: "This is true." Not: "This will replicate." 4/ p = 0.0499 and p = 0.0501 are nearly identical. But one gets you a “significant” label. The other gets dismissed. That’s broken thinking. 5/ Quoting Mike Love: “A smaller p-value is not more interesting.” “We should focus on effect sizes.” He’s right. 6/ What’s an effect size? It tells you how big the difference is. Not just if it’s statistically detectable. A gene with a log2 fold change of 3 matters. Even if p = 0.06. 7/ P-values shrink with more data. Got 10,000 samples? You’ll find “significance” for even the tiniest differences. Statistically significant ≠ Biologically meaningful. 8/ Also, be careful when testing thousands of genes. Even with a p < 0.05 threshold, false positives will sneak in. Use multiple testing correction: FDR, Bonferroni. Always. 9/ Let’s reframe: Instead of: “Did I beat the p < 0.05 line?” Ask: Is the effect meaningful? Is it reproducible? Does it make biological sense? 10/ Want a better practice? Look at the distribution of p-values. Report adjusted p-values (FDR). Highlight effect sizes. Don’t cherry-pick. 11/ And don’t forget confidence intervals. They show the range of plausible effect sizes—not just a binary yes/no. More context, more truth. 12/ Key takeaways: 0.05 is a line in sand, not a cliff p-values ≠ effect size Focus on biological meaning Always correct for multiple testing Use p-values as part of the story—not the whole story 13/ If you're making big decisions based on p = 0.0499 vs 0.0501... You're not doing science. You're doing stats theater. Look deeper. Think harder. Go beyond the stars. 14/ And please—share this with a friend still chasing tiny p-values. Let’s stop celebrating noise and start celebrating insight. I hope you've found this post helpful. Follow me for more. Subscribe to my FREE newsletter chatomics to learn bioinformatics https://lnkd.in/erw83Svn
Mastering Statistical Analysis
Explore top LinkedIn content from expert professionals.
-
-
STATISTICAL V. SUBSTANTIVE SIGNIFICANCE Take a moment to consider the following scenario. One study with n = 100 reports a focal effect with an associated p-value of 0.02. Another study with n = 1000 reports a focal effect with an associated p-value of 0.02. Which study presents the strongest evidence the effect is really there? This scenario is adapted from Bakan (Psych. Bull. 1966). Many scholars chose the second scenario. They are wrong. I quote Bakan (p. 429): “The rejection of the null hypothesis when the number of cases is small speaks for a more dramatic effect in the population [larger effect size]; and if the p-value is the same, the probability of committing a Type I error remains the same.” Many papers equate implicitly or explicitly statistical significance with substantive significance. Yet, a p-value does not inform you whether the effect has any real world meaning. Ralph Tyler (Educ. Res. Bulletin, 1931) already wrote that a statistically significant difference is not necessarily an important difference, and a difference that is not statistically significant may be an important difference. Unfortunately, we are still making the same mistake 90 years later. A statistically significant result may be substantively nonsignificant (trivial). But also, a statistically nonsignificant result may be substantively significant. I see so many studies in our field reporting regression coefficients with *** and I have no idea how large the effect is. This tendency to equate statistical with substantive significance persists to the extent that the prestigious American Statistical Association (not exactly an organization afraid of advanced statistics) came out with a formal statement on p-values—the “ASA Statement on Statistical Significance and P-Values" cautioning researchers: "Statistical significance is not equivalent to scientific, human, or economic significance. Smaller p-values do not necessarily imply the presence of larger or more important effects, and larger p-values do not imply a lack of importance or even lack of effect. Any effect, no matter how tiny, can produce a small p-value if the sample size or measurement precision is high enough, and large effects may produce unimpressive p-values if the sample size is small or measurements are imprecise." I am not arguing against statistical significance. Rather that articles should report statistical AND substantive significance. In my view, our primary task is uncovering factors that make a meaningful difference. That calls for effect sizes. As a nice “bonus,” substantive significance is less amenable to p-hacking than statistical significance. If you enjoyed this, share it with others and follow me, Jan-Benedict Steenkamp, for more writing. Journal of Marketing
-
A/B testing is a staple in the industry, often highlighted as the gold standard for experimentation. But how often do we talk about causal analysis, the broader and equally important field that underpins it? While it may be less commonly referenced, causal analysis is fundamental to answering deeper questions about cause-and-effect relationships in data. This introductory blog by a Microsoft data scientist provides a clear and approachable overview of causal analysis, breaking down its major components and their applications. Broadly, causal analysis can be categorized into two key areas: -- Causal Discovery: This focuses on identifying the underlying causal structure from data. It answers questions like, "What factors influence an outcome, and how are they connected?" Algorithms like the Peter-Clark algorithm and Greedy Equivalence Search help uncover these relationships, often represented as causal graphs. -- Causal Inference: This focuses on quantifying the effect of one variable on another. It answers questions like, "How much does X cause Y?" Techniques range from experimental approaches like A/B testing to observational methods like propensity score matching, instrumental variables, and difference-in-differences. Our commonly known A/B testing is a subset of causal inference and relies on controlled experiments to estimate effects. However, non-experimental approaches offer powerful alternatives, especially when experiments aren’t feasible. If you’re curious about expanding your understanding of causality and its practical applications, this blog is a great starting point to explore how causal analysis can elevate data-driven decision-making. #datascience #analytics #causal #discovery #inference #abtest – – – Check out the "Snacks Weekly on Data Science" podcast and subscribe, where I explain in more detail the concepts discussed in this and future posts: -- Spotify: https://lnkd.in/gKgaMvbh -- Apple Podcast: https://lnkd.in/gj6aPBBY -- Youtube: https://lnkd.in/gcwPeBmR https://lnkd.in/gfxTjapV
-
Here's an underrated approach to estimate the causal impact of a product change that affects some users but not others. This is common in product settings, so let's take a simple example: measuring the impact on app downloads when a change is made to the iOS version of an app but not the Android version. The three-step approach is: 𝐒𝐭𝐞𝐩 1: In the pre-treatment period, use a Bayesian structural time series function (𝘧) to learn how the treated group's time series relates to the control group's time series. In our example, this means learning the function that recreates iOS app downloads as a function of Android app downloads. 𝐒𝐭𝐞𝐩 2: In the post-treatment period, use the function (𝘧) to calculate the post-treatment counterfactual for iOS app downloads based on the actual Android app download values. 𝐒𝐭𝐞𝐩 3: Calculate the causal impact by taking the difference between the actual post-treatment iOS app downloads and the counterfactual values from Step 2. The visual shows how this works! (Links to the R and Python packages in the comments) Cite: Brodersen KH, Gallusser F, Koehler J, Remy N, Scott SL. Inferring causal impact using Bayesian structural time-series models. 𝘈𝘯𝘯𝘢𝘭𝘴 𝘰𝘧 𝘈𝘱𝘱𝘭𝘪𝘦𝘥 𝘚𝘵𝘢𝘵𝘪𝘴𝘵𝘪𝘤𝘴, 2015, Vol. 9, No. 1, 247-274
-
How can an A/B test be “statistically significant” but not be totally trustworthy? I’ve been wrestling with this question for over a decade. Through extensive research, I've now come to understand that a big part of the answer lies in confidence intervals. Here’s the simplest way to explain it: Imagine you run an A/B test with very small numbers: 🚥 Version A: 82 visitors, 3 conversions 🚦 Version B: 75 visitors, 12 conversions The math shows the result is statistically significant. ⚡ The p-value is 0.0088, well below the common p < 0.05 threshold ⚡ Observed power is reported as 95.09%, well above the standard 80% rate The results look convincing. Statistical significance, long treated as the gold standard, has been achieved! But, here's the problem. Statistical significance can be "gamed" with low traffic tests because it only answers one narrow question: 🔦 If there were truly no difference between versions, how likely is it this result would happen by chance? That’s it. That's all statistical significance answers. It doesn't tell you whether the result is stable or repeatable. And, as you can imagine, with tiny samples, like 3 vs. 12 conversions, you get exaggerated effects. Every single conversion has an outsized influence. One or two people behaving differently can completely flip the outcome. ➡️ This is where confidence intervals come in. A confidence interval is the range of outcomes that could reasonably be true given the data. In small tests, that range is really wide. So the actual conversion effect might be smaller or larger than what you achieved in the test. You can't know with precision. So you don't have a dependable estimate of how big the improvement really is, or whether the result would hold if you ran the test again. It's important to realize, a confidence interval is not the same as a confidence level. Remember: a confidence interval is the range of values that could reasonably be true given the data. A label of “95% confidence” describes how that range was constructed, not how certain or correct the result is. Which means, a 95% confidence interval can still be very wide, creating substantial uncertainty around the estimate. When there's such uncertainty, the numbers may appear exaggerated. That's where Twymann’s Law comes in. It states, anything that looks interesting or unusual is usually wrong. In small samples, results are extreme because the noise does most of the work. So while a statistically significant difference can be measured in a small-sample study, you can't reliably measure how big that difference actually is. That's why 3 vs. 12 conversions often fail to replicate once more data is collected. 📣 Call to action for 2026: Run tests that are not only statistically significant, but also have a large enough sample size to produce narrow confidence intervals, so you can not only detect effects, but also estimate them precisely enough to make accurate, trustworthy test decisions.
-
🚨 Propensity Score Matching (PSM) is still being misused in 2026 — and it shows up clearly in peer review. PSM is not a magic button. It is a study design strategy, not a post-hoc fix for confounding. I put together a complete interpretation and reporting guide that I now use with: • clinical researchers • epidemiology trainees • journal reviewers What this guide emphasizes (and what many papers still overlook): 🔹 Causal assumptions come first If exchangeability fails, matching will not rescue the analysis. 🔹 Balance > p-values Standardized mean differences, overlap, and diagnostics matter more than statistical significance. 🔹 Matching quality is observable Poor overlap, extreme weights, and residual imbalance should be reported, not hidden. 🔹 PSM ≠ causal inference by default Sensitivity analyses and transparent limitations are non-negotiable. This is designed as a practical reference — something to review before submission, not after rejection. If you work with real-world data, registries, or retrospective cohorts, this should be immediately useful. 👉 If this adds value, consider sharing or reposting so it reaches others working with observational data. 👉 Thoughtful comments and critiques are welcome. #CausalInference #Epidemiology #ClinicalResearch #Biostatistics #RealWorldData #HealthDataScience #MedicalResearch #ResearchMethods #Statistics #EvidenceBasedMedicine
-
unpopular opinion: If your goal is to reason about data, do not use explainable AI #xAI. Simply don't. xAI is fascinating, but it is tailored towards understanding the structure of a trained 𝘮𝘰𝘥𝘦𝘭. On the contrary, disentangling the underlying relationships between features and the outcome 𝘪𝘯 𝘵𝘩𝘦 𝘳𝘦𝘢𝘭 𝘸𝘰𝘳𝘭𝘥 is the domain of causal inference from the #statistics literature. While maybe appearing less sexy, nowadays there are many ways of how to incorporate #machinelearning into causal estimation, some of which free you from setting up an explicit causal model. Here are my 📢 𝐭𝐨𝐩 𝐭𝐡𝐫𝐞𝐞 𝐫𝐞𝐬𝐨𝐮𝐫𝐜𝐞𝐬 📢 for taking #causality to the next level: - causaLens: AI-powered decision-making with ready-to-use software - Martin Spindler and Martin Huber (I count these as one as they share the same first name): academic literature and #python implementations on causal AI and double machine learning - book "The Effect: An Introduction to Research Design and Causality" by Nick Huntington-Klein. Very nice and open access introduction into the background of causal effect estimation.
-
When A/B tests aren’t possible, how can we still measure causal impact? Synthetic control models are one of the most powerful answers---described as “arguably the most important innovation in policy evaluation in the last 15 years” by Nobel laureate Guido Imbens. But at the scale of large online platforms like Spotify, a key challenge emerges: how do we choose valid donor controls when the pool of potential donors is massive and domain knowledge alone isn’t enough? In our recent paper in Journal of Causal Inference, the Causal Machine Learning lab at Spotify R&D introduced a data-driven approach for donor selection that relaxes the need for perfect domain knowledge and provides formal bias bounds to help quantify uncertainty. Check out the latest blog post on the Spotify research webpage for an overview of our work: 👉 https://lnkd.in/em9at25K Joint work with Michael O'Riordan #causal #research #causalinference #machinelearning
-
𝗚𝗲𝘁𝘁𝗶𝗻𝗴 (𝗺𝗼𝗿𝗲) 𝗰𝗮𝘂𝘀𝗮𝗹 𝗶𝗻𝘀𝗶𝗴𝗵𝘁𝘀 𝗳𝗿𝗼𝗺 𝗲𝗺𝗽𝗹𝗼𝘆𝗲𝗲 𝘀𝘂𝗿𝘃𝗲𝘆 𝗱𝗮𝘁𝗮 (𝘄𝗶𝘁𝗵𝗼𝘂𝘁 𝗮𝗻 𝗥𝗖𝗧) 🤔 While I was recently in the middle of processing our annual employee survey data, I started thinking about how to make the analysis results even more actionable and impactful. 👨✈️ The usual “driver analysis” (regress outcome on multiple drivers + controls, or use relative weights analysis) is useful but purely correlational - it doesn’t support directional claims that inform interventions. 🔄 A stronger step in this direction would be a 𝘤𝘳𝘰𝘴𝘴-𝘭𝘢𝘨𝘨𝘦𝘥 𝘱𝘢𝘯𝘦𝘭 𝘮𝘰𝘥𝘦𝘭𝘪𝘯𝘨 (CLPM) - a method that uses repeated survey data to estimate directional (not definitive causal) effects between drivers and outcomes like engagement or intention to stay. Why CLPM gets you closer than simple regression: ▪️ Respects temporal precedence (T1 → T2). ▪️Controls autoregressive stability (baseline levels). ▪️Allows within-wave residual correlations, so cross-lags aren’t inflated. ▪️Can model reciprocal (bidirectional) relations common for constructs like engagement or satisfaction. When to use it in practice: ▪️ You have at least two waves of employee survey data (annual or pulse). ▪️ Consistent construct measurement across waves. ▪️ Reasonably large sample (hundreds+) to support the model. ▪️ With 3+ waves, you can move to random intercept models (RI-CLPM) to separate stable between-person differences from actual within-person change. ⚠️ Despite providing a stronger foundation for prioritizing interventions, it still comes with some important caveats: ▪️CLPM strengthens causal inference, but it cannot magically turn observational data into a RCT, so unmeasured confounding can remain. ▪️Attrition isn’t random - leavers are often less engaged, so simply analyzing the "survivors" who remain can create a heavily biased dataset and lead to misleading, overly optimistic conclusions. It's crucial to use proper statistical methods (like FIML when MAR is plausible or sensitivity checks if MNAR is likely) to handle this. ❓ Has anyone here experimented with CLPM or similar longitudinal methods (e.g., fixed effects panel models) to better understand engagement drivers? What insights - or pitfalls - did you run into? P.S. If interested, you can check out an example of 2-Wave CLPM in action with dummy data on my blog here: #PeopleAnalytics #EmployeeExperience #SurveyData #CausalInference #Stats
-
“What does a p-value actually mean?” Most people learn that if p < 0.05, a result is “significant.” But what does that really mean? Let’s forget the math for a second. Imagine you’re testing a claim: Someone tells you they built a coin that always lands on heads. You’re skeptical. You flip the coin ten times and get eight heads. Now you wonder, “If this were a normal, fair coin, how likely would it be to get eight or more heads just by luck?” That number — that probability — is your p-value. If that chance is small (say, 4 percent), then your result would be surprising if the coin were truly fair. So you start to suspect the coin might actually be biased. That’s it. A p-value tells you how surprising your result would be if nothing unusual were going on. It does not tell you that your hypothesis is true. It does not mean there’s a 96 percent chance you’re right. It only says, “If the world were normal, this result would be weird.” So a small p-value means your data doesn’t fit well with “nothing’s happening.” A large p-value means your data looks ordinary, so you can’t rule out that it’s just chance. In short: 👉 The smaller the p-value, the stronger the evidence against “nothing’s happening.” 👉 But it never proves you’re right — it just shows your result is unlikely under the “no effect” world. That’s all a p-value is. A way of asking: “Would this result be rare if nothing was actually different?” --- 𝘐𝘮𝘢𝘨𝘦 𝘚𝘳𝘤: 𝘮𝘺𝘴𝘵𝘶𝘥𝘪𝘦𝘴 [.] 𝘤𝘰𝘮