The Winner's Curse Overestimates Impact
Statistically significant lifts are not unbiased estimates of impact. If you ship based on raw winners, you will systematically over-forecast and over-credit changes.
Why winners are biased
The winner's curse is simple. Any time you apply a threshold to decide what counts as a win, the average effect size among wins is larger than the true effect size. In A/B testing the most common threshold is p < 0.05, but the same logic holds for Bayesian thresholds, minimum uplift rules, and any dashboard that only highlights green numbers.
This matters because most organisations treat the observed lift from the winning test as the expected lift after launch. That turns statistical noise into a budgeting and roadmap problem. Your forecasted revenue uplift is too high. Your per-team impact metrics drift upward. Your expected ROI is inflated.
If you want trustworthy decision making, you should treat an A/B test winner as a directional signal, not as a precise estimate. Then you use a conservative planning adjustment that reflects how much selection bias you have introduced.
A conservative adjustment is not pessimism
It is an adjustment for selection bias. You are conditioning on a result being “good enough” to be noticed. A conservative adjustment turns an enthusiastic estimate into a more realistic planning number.
Power determines inflation
Low power means most true effects will not clear the significance threshold. The only winners you see are the ones helped by sampling noise, so the bias is larger.
The mechanics: conditioning on significance
Imagine the true lift is +2%. You run an experiment and compute an estimate Δ̂ with standard error SE. If the estimator is approximately normal, then Δ̂ is distributed around the true effect Δ.
Now apply a selection rule: you only call it a win if the test is significant. In a two-sided z-test that means |Δ̂ / SE| > zα/2. This condition selects values in the tails of the distribution, where the magnitude is larger.
Gelman and Carlin popularised this as a Type M (magnitude) error problem. The question is not only whether you get the sign right, but how wrong your magnitude is when you do declare significance.
Lu, Qiu, and Deng provide a concise treatment of how Type M inflation grows as power falls. It is a useful reference when you are forced to run low-traffic experiments and still need to communicate uncertainty clearly.
A useful mental model
The p-value controls the rate of false positives among all tests you run. It does not control how accurate the point estimate is among winners. Conditioning on passing a threshold changes the expected value of the estimate, even when the estimator is unbiased unconditionally.
How much do winners exaggerate?
The exaggeration factor depends strongly on power. When power is high, most true effects that exist will clear the threshold, so the winners include many typical realisations. When power is low, wins are rare and heavily skewed toward lucky draws.
The numbers below are illustrative values that match the common intuition from Gelman and Carlin's analysis. Treat them as a planning baseline, not as a law of nature. Your exact adjustment depends on your test design, your stopping rules, and how many metrics and variants you look at.
Exaggeration among significant winners vs planned power
A practical adjustment for reporting A/B test winners
The easiest way to stop over-forecasting is to separate celebration from planning. Teams can still celebrate a win, but they also report a conservative planning lift used for forecasting, prioritisation, and impact accounting.
For well-powered tests, the adjustment is modest. For underpowered tests it can be huge. The key is to base the adjustment on the planned power for the estimand that matters, not on the overall traffic split.
Practical guidance
Rule of thumb
- Well powered (80%): plan with a minimum 13% downward adjustment to the point estimate.
- Many metrics or variants: default to 20% or more unless you have strong controls.
- Triggered experiments: compute power on the triggered population and interpret wins based on that.
Concrete example
Early stopping warning
If you peek and stop early when things look good, you can inflate effect size even when Type I error is controlled. The overestimation problem still applies, and in aggressive early stopping regimes it may be larger.
What makes overestimation worse in production
The power-based exaggeration is the minimum inflation you should expect. Real experimentation programmes usually add extra inflation through multiple comparisons and flexible decision making. Even if your primary analysis is correct, the process around it can quietly create more selection bias.
More shots on goal
- A/B/n tests, many success metrics, and many segments increase the chance that something looks significant by luck.
- When you run A/B/C/D and only ship the best-looking variant, the reported uplift of the winner is biased upward even if you correct p-values.
- If you browse dozens of cuts in a dashboard, you are implicitly doing multiple hypothesis testing, even if you never write it down.
Flexible decisions
- If you decide sample size on the fly, change the primary metric mid-test, or choose the best time window after looking, you create extra winner's curse.
- Early stopping can add inflation on top of power-based exaggeration, especially when you stop on a high point in a noisy metric.
- This is why a simple 20% default adjustment is often a more honest forecasting convention than using raw winners.
Triggered tests: measure the right power
If only 30% of users are eligible for the treatment, your effective sample size is 30% of what the dashboard suggests. Your interpretation should be based on the triggered population and on the metric definition for that population. This is also why teams should avoid celebrating wins from tiny triggered segments without replication.
The p-value does not answer the question stakeholders ask
In organisations, a p-value often gets interpreted as “the probability the change is real”. That is not what it is. A p-value is P(data | null), not P(null | data). The second quantity depends on a prior belief about how often changes truly move the metric.
When the prior probability of large real effects is low, significant winners are more likely to be the product of a small true effect plus noise, or of analysis flexibility that effectively increases the number of chances you give yourself. That pushes you toward larger planning adjustments and more replication for surprising wins.
This is not unique to product experiments. Across research fields, selecting results that cross thresholds inflates reported effects. Ioannidis showed that discovered associations are often exaggerated, especially when studies are underpowered and many analyses are tried.
If you want a stakeholder-friendly summary, use two numbers. First, the statistical result (p-value and confidence interval). Second, a planning estimate (a conservative adjusted lift) that you use for forecasting and prioritisation. This is closely related to practical significance. If you have not read it, start with practical significance in A/B testing.
What to report by default
How to implement better reporting without slowing down
You do not need perfect Bayesian modelling to reduce winner's curse. You need conventions that keep incentives aligned. The point is to keep enthusiasm and forecasting separate.
A simple policy
- Plan power and sample size up front.
- Ship based on the statistical decision rule.
- Forecast and credit based on a conservative planning lift.
- Replicate when the result is surprising or high stakes.
Use your tools
- Use the MDE calculator to sanity check whether your team is repeatedly running underpowered tests.
- Use the duration calculator to avoid pressure to stop early because the test is taking too long.
- If you are routinely celebrating tiny wins, revisit your Minimum Effect of Interest and practical significance thresholds.
References
- Gelman, A., & Carlin, J. (2014). Beyond Power Calculations: Assessing Type S (Sign) and Type M (Magnitude) Errors. Perspectives on Psychological Science. PDF
- Button, K. S., et al. (2013). Power failure: why small sample size undermines the reliability of neuroscience. Nature Reviews Neuroscience. Nature
- Kohavi, R., Deng, A., & Vermeer, L. (2022). A/B Testing Intuition Busters: Common Misunderstandings in Online Controlled Experiments. KDD 2022. arXiv
- Ioannidis, J. P. A. (2008). Why most discovered true associations are inflated. Epidemiology. DOI
- Lu, J., Qiu, J., & Deng, A. (2019). A note on Type S and Type M errors in hypothesis testing. British Journal of Mathematical and Statistical Psychology. DOI
Found this useful?
Share it with your team so winners are reported more honestly.
Related Resources
Analyze your experiment results.
Plan experiments with proper power analysis.
Determine your minimum detectable effect.
Learn when a result actually matters.
Avoid costly errors with charts and checklist.
Protect experiments from hidden harm.
Frequently Asked Questions
Make your next result more trustworthy
Plan power up front, report intervals, and avoid over-forecasting wins.