Understanding Practical Significance in A/B Testing
Statistical significance tells you if a result is real.
Practical significance tells you if it matters.
Learn how to tell the difference.
Every experimenter has been there: the test reaches statistical significance, the p-value is well below 0.05, and the team celebrates a "winner." But when the change ships to 100% of traffic, the expected revenue bump never materialises. What went wrong?
The answer often lies in a concept that is widely misunderstood and frequently ignored: practical significance. A result can be statistically significant, meaning the observed difference is unlikely due to chance, while being too small to matter in any meaningful business context.
This distinction is not academic. It is the difference between shipping changes that move the needle and wasting engineering cycles on noise dressed up as signal. Understanding practical significance is essential for anyone making decisions based on experiment data.
What Is Practical Significance?
These two concepts answer fundamentally different questions. A result can be one without the other, and the ideal outcome is a result that is both: a real effect that is large enough to matter.
Statistical Significance
"Is this result real, or could it have happened by chance?"
- •Measured by the p-value
- •Conventional threshold: p < 0.05
- •Influenced by sample size
- •Says nothing about the size of the effect
Practical Significance
"Does this result matter in the real world?"
- •Measured by effect size
- •Threshold defined by business context
- •Independent of sample size
- •Tells you the magnitude of the effect
Both are required for sound decisions
Statistical significance tells you whether an effect exists. Practical significance tells you whether it is worth acting on. Shipping a change that is only statistically significant is a common and costly mistake.
The Large Sample Trap
This is arguably the most important pitfall in experimentation at scale. With large enough sample sizes, virtually any non-zero difference will achieve statistical significance. This creates a dangerous illusion: teams celebrate p-values under 0.05 while ignoring that the actual improvement is negligible.
The Aspirin Study
The Physicians' Health Study enrolled 22,000+ subjects and found aspirin reduced heart attacks with p < 0.00001. But the risk difference was just 0.77%. This led to widespread recommendations despite minimal actual benefit for many patients.
The 0.1% Conversion "Win"
An A/B test with a million users returns p < 0.001, but the actual difference is 20.1% vs. 20.0% conversion rate. The team celebrates, but this 0.1 percentage point difference may not justify any engineering effort to ship.
The Diet Method Comparison
A study comparing two diet methods with 26,000 participants found mean weight loss of 10.6 kg vs. 10.5 kg. Statistically significant at p = 0.01, but Cohen's d = 0.015. A 0.1 kg difference between methods is meaningless.
The takeaway
If your platform serves millions of users, you can detect differences so small that shipping them would cost more than the benefit they provide. Always pair p-values with effect size assessment.
Effect Size: The Bridge Between Statistical and Practical Significance
Effect size quantifies the actual magnitude of the difference between groups. Unlike p-values, effect sizes are independent of sample size. As Sullivan and Feinn (2012) stated: "While a p-value can inform the reader whether an effect exists, the p-value will not reveal the size of the effect."
Relative Lift
The most common measure in A/B testing. Control at 5.0%, variant at 5.5% = 10% relative lift.
Absolute Difference
The raw difference between groups. In the example above, that is 0.5 percentage points.
Cohen's d
Standardised effect size. Benchmarks: 0.2 small, 0.5 medium, 0.8 large.
P-Value vs. Effect Size at a Glance
| P-Value | Effect Size | |
|---|---|---|
| Tells you if effect exists | ||
| Tells you magnitude of the effect | ||
| Independent of sample size | ||
| Helps assess practical significance |
Reporting effect sizes alongside p-values is not new advice. The American Psychological Association's Task Force on Statistical Inference recommended reporting effect sizes back in 1999. The A/B testing world is still catching up.
Minimum Effect of Interest: Defining What Matters Before the Test
The Minimum Effect of Interest (MEI) is the smallest true effect size that would be worth detecting and acting on. It is a business input, not a statistical one. Defining it requires asking: "What is the smallest improvement that justifies the cost of running this test and implementing the change?"
Minimum Effect of Interest
The smallest effect that justifies the cost of implementation. Defined by business context, revenue impact, and opportunity cost.
Minimum Detectable Effect
The effect size at which a test achieves target power (typically 80%). An output of test design, not an input.
Ideally, the MDE should be aligned with the MEI. If your MDE is smaller than your MEI, you risk detecting effects too small to matter. If your MDE is larger than your MEI, you risk missing effects that would be valuable.
How to Define the MEI
Start with business impact
What is the smallest improvement that would make the test worth running?
Consider implementation costs
A 1.5% lift might not be worth it if the ROI break-even is 18 months.
Factor in opportunity cost
If shipping this delays a higher-impact initiative, the bar is higher.
Account for scale
A 1% lift at high volume might be worth millions. At low volume, months of testing for modest returns.
Avoid the "Anything Above Zero" Trap
When first introduced to MEI, many practitioners say "anything above zero is exciting." This leads to tests that run indefinitely and detect differences with no business value. The MEI must be grounded in a realistic cost-benefit analysis.
Using Confidence Intervals to Assess Practical Significance
Confidence intervals are more informative than p-values for assessing practical significance because they show the plausible range of values for the true effect, not just whether the effect is non-zero.
Reading Confidence Intervals Against Your MEI
This approach gives you far more nuanced decision-making than a binary "significant vs. not significant" verdict. A 95% confidence interval of [+0.1%, +0.3%] tells you the effect is real (it does not contain zero) but also tells you the effect is small (at most 0.3%). A p-value alone would only tell you it is "significant."
A Decision Framework: The Four Quadrants
Every A/B test result falls into one of four quadrants based on whether it is statistically significant and practically significant. Understanding which quadrant your result occupies determines the right course of action.
Practically Significant
Ship it
The effect is real and large enough to matter. This is the ideal outcome.
Practically Significant
Increase sample size
The observed effect is large enough to matter, but the test lacked power to confirm it.
Not Practically Significant
Do not ship
The effect is real but too small to justify action. The costs likely outweigh the benefits.
Not Practically Significant
No action needed
No meaningful effect detected. Move on to the next experiment.
Step-by-Step Assessment
Before the test
Define the MEI based on cost-benefit analysis. What is the smallest lift that justifies implementation costs?
Design the test
Use power analysis to ensure the test can detect the MEI with adequate power (80%+).
After the test
Look at the confidence interval, not just the p-value. Does it overlap with the MEI? Use the significance calculator to examine the full results.
Evaluate holistically
Consider implementation cost, opportunity cost, maintenance burden, user experience impact, and strategic alignment.
Common Pitfalls
Equating statistical significance with importance
A p-value under 0.05 means the effect is unlikely due to chance. It says nothing about whether the effect is worth acting on.
Ignoring effect size
Reporting only p-values without effect sizes or confidence intervals deprives decision-makers of the information they need.
Large samples creating false excitement
With millions of users, even a 0.01% difference can reach p < 0.001. Always ask: "Is this difference meaningful?"
Not defining MEI before the test
Without pre-defining what "matters," teams fall into post-hoc rationalisation of whatever result they get.
Peeking and premature stopping
Checking results early inflates false positive rates and can make trivially small effects appear significant.
Confusing MDE with MEI
The MDE is a property of test design. The MEI is a business input. They should be aligned, but they serve different purposes.
Multiple comparisons without correction
Testing several variants across many metrics creates dozens of comparisons. At a 5% false positive rate, spurious results are expected.
The "anything above zero" fallacy
Setting MEI at near-zero leads to tests that run indefinitely and detect differences with no business value.
References
- Sullivan, G. M. & Feinn, R. (2012). Using Effect Size, or Why the P Value Is Not Enough
- Kirk, R. E. (1996). Practical Significance: A Concept Whose Time Has Come
- Georgiev, G. Statistical Power, MDE, and Designing Statistical Tests
- Kohavi, R., Tang, D. & Xu, Y. (2020). Trustworthy Online Controlled Experiments
- Penn State STAT 200: Practical Significance
Help others learn about practical significance
If you found this article valuable, share it with your experimentation team and help spread better A/B testing practices.
Related Resources
Analyze your experiment results.
Plan experiments with proper power analysis.
Determine your minimum detectable effect.
Test whether a variant is not worse.
Why borderline results cause the most damage.
100+ experimentation terms explained.
Frequently Asked Questions
Put Practical Significance Into Practice
Use our calculators to plan tests with the right MDE and analyse results with confidence intervals.