Significance and Power: Understanding Errors
Type I/II errors, statistical power, and how to design tests that detect real effects.
Balancing Risk in Testing
Every A/B test faces two types of risk: declaring a false winner (Type I error) or missing a real winner (Type II error). Understanding these errors-and how to control them-is essential for running reliable tests.
The good news: you can control both through smart test design.
The Two Types of Errors
The Decision Matrix
| True State of the World | ||
|---|---|---|
| H₀ is True (No Real Effect) | H₁ is True (Real Effect Exists) | |
| Reject H₀ (Declare Significant) | Type I ErrorFalse Positiveα = 0.05"Crying Wolf" | Correct DecisionTrue PositivePower = 1 - β"Detected Real Effect" |
| Fail to Reject H₀ (Not Significant) | Correct DecisionTrue Negative1 - α = 0.95"Correctly No Effect" | Type II ErrorFalse Negativeβ (typically 0.20)"Missed Real Effect" |
Type I Error (α)
False Positive: Declaring a winner when there's no real effect.
Example: Your test shows variant B increased conversions by 5%, but it was just random luck. You ship it, and conversions don't actually improve.
Control: Set α = 0.05 (5% false positive rate). This is your significance threshold.
Type II Error (β)
False Negative: Missing a real effect, declaring "no winner" when one exists.
Example: Variant B truly improves conversions by 3%, but your test doesn't reach significance. You ship nothing and miss the win.
Control: Increase sample size to boost power (1 - β). Target 80% power means β = 0.20.
Memorable Analogy: Pregnancy Test
Type I Error (False Positive)
Test says pregnant, but you're not. You start preparing for a baby that doesn't exist.
Type II Error (False Negative)
Test says not pregnant, but you are. You miss important preparations and health care.
Statistical Power: The Missing Metric
What Is Power?
Statistical power is the probability of detecting a real effect when it exists. It's the opposite of Type II error: Power = 1 - β.
Standard target: 80% power (β = 0.20). This means if there's a real effect, you have an 80% chance of detecting it.
Here's the critical insight: most teams obsess over significance (α) but ignore power (β). They set α = 0.05 to avoid false positives but don't check if their test has enough power to detect real effects.
Result? Tests that end "inconclusive" not because there's no effect, but because the test wasn't sensitive enough to detect it.
What Affects Power?
Four factors control your test's power. Understanding these helps you design better experiments:
1. Sample Size (n) ⬆️ Power ⬆️
More users = more power. Doubling your sample size increases power significantly.
Example: A test with 1,000 users per variant has ~40% power to detect a 2% lift. With 5,000 users, power jumps to ~95%.
2. Effect Size (Δ) ⬆️ Power ⬆️
Bigger effects are easier to detect. A 20% lift is easier to spot than a 2% lift.
Problem: You don't control effect size-your product changes do. But you can focus on high-impact tests.
3. Significance Level (α) ⬆️ Power ⬆️
Relaxing significance increases power. If you use α = 0.10 instead of 0.05, power increases.
Trade-off: Higher α means more false positives. Most teams keep α = 0.05 as the standard.
4. Baseline Variance (σ) ⬇️ Power ⬆️
Lower variance = more power. If your metric has less noise, effects are easier to detect.
Solution: Use variance reduction techniques (stratification, CUPED) or choose less noisy metrics.
The α vs β Trade-Off
You Can't Minimize Both
There's an inherent trade-off: lowering α (fewer false positives) increases β (more false negatives) unless you increase sample size. This is why sample size calculators ask for desired power (typically 80%) and significance level (typically 95%).
Industry Standards
- • α = 0.05 (5% false positive rate)
- • Power = 0.80 (80% detection rate)
- • β = 0.20 (20% false negative rate)
These are arbitrary conventions, but they're well-established and expected.
Adjust For Your Context
- • High-risk changes: Lower α (0.01), higher power (0.90)
- • Early exploration: Higher α (0.10), lower power (0.70)
- • Standard tests: Stick with 0.05 / 0.80
Document your choices upfront in your test plan.
How to Use This Knowledge
Before Running a Test:
- 1.Run a power analysis. Use a sample size calculator to determine how many users you need for 80% power.
- 2.Set your MDE (Minimum Detectable Effect). What's the smallest lift worth detecting? Don't test for 0.5% lifts if you need 10,000,000 users.
- 3.Check if you have enough traffic. If you can't hit 80% power in a reasonable timeframe, reconsider the test.
After an Inconclusive Test:
- 1.Check your power. Did you actually have 80% power? If not, you might have missed a real effect.
- 2.Consider running longer. If you're at 50% power, doubling runtime might get you to 80%.
- 3.Accept the reality. If the effect is smaller than your MDE, it might not be worth detecting anyway.
Key Takeaways
- ✓Type I error (α): False positive, declaring a winner when there isn't one.
- ✓Type II error (β): False negative, missing a real effect.
- ✓Statistical power = 1 - β, probability of detecting a real effect. Target 80%.
- ✓Power increases with sample size, effect size, and higher α.
- ✓Always run a power analysis before starting a test to determine required sample size.
- ✓Inconclusive tests often mean low power, not necessarily no effect.