Common Pitfalls: Avoiding Test Failures
SRM, peeking problems, multiple comparisons, and other mistakes that invalidate results.
What Separates Experts from Beginners
You've learned the fundamentals, the statistics, and the methodology. Now let's cover the mistakes that invalidate even well-designed tests. Knowing these pitfalls is what separates beginners from experts.
These aren't theoretical concerns-they're real issues that plague production A/B tests every day. Recognizing and avoiding them will save you from wasted time and bad decisions.
The Big Three Pitfalls
Sample Ratio Mismatch (SRM)
What it is: When the actual split doesn't match the intended split. If you wanted 50/50 but got 52/48, something's wrong with your randomization.
Example:
Expected: 10,000 users per variant
Actual: Control 10,523 / Variant 9,477
SRM detected!
Why it matters:
Indicates a bug in randomization. Could be browser issues, bot filtering, or technical errors. Invalidates your test.
How to detect:
Run a chi-squared test on your sample sizes. If p-value < 0.001, you have SRM.
What to do:
Stop the test immediately. Investigate the bug, fix it, and restart. Never ship results from a test with SRM.
The Peeking Problem
What it is: Checking results repeatedly during a test and stopping early when you see significance. This inflates your false positive rate from 5% to 20-30%.
Why it's tempting:
"The test hit significance on Day 3! Let's stop early and ship it." Sounds efficient, but it's wrong.
Why it's wrong:
Early in a test, metrics bounce around. What looks significant on Day 3 often regresses by Day 7.
Solutions:
- 1. Fixed horizon: Decide sample size upfront, don't peek. Run to completion.
- 2. Sequential testing: Use Bayesian methods or group sequential designs that allow peeking with corrections.
- 3. Always Valid Inference (AVI): Use confidence sequences that remain valid at any stopping point.
Best practice: If using Frequentist methods, commit to a fixed sample size and don't peek.
Multiple Comparisons Problem
What it is: Testing multiple metrics or variants without adjusting significance thresholds. Each test adds more chances for a false positive.
Example:
You test 10 metrics. Each has 5% false positive rate. Overall false positive rate? Nearly 40%!
The math:
P(at least one false positive) = 1 - (0.95)^n
For n=10 tests: 1 - (0.95)^10 = 40%
Solutions:
- 1. Pre-specify primary metric: Choose ONE decision metric upfront. Others are exploratory only.
- 2. Bonferroni correction: Divide α by number of tests (e.g., 0.05/10 = 0.005 threshold).
- 3. False Discovery Rate (FDR): Control the proportion of false positives using Benjamini-Hochberg.
Best practice: Define your success metric before the test. If you must check multiple metrics, use corrections.
Other Common Mistakes
Novelty Effect
Users engage more with new features initially, then behavior normalizes. Run tests for 1-2 weeks minimum to account for this.
Primacy Effect
Returning users prefer the old experience. They need time to adapt. Consider separating new vs returning users in analysis.
Selection Bias
Testing only on engaged users (e.g., logged-in, desktop-only) limits generalizability. Define your population carefully.
Ignoring Seasonality
Testing during Black Friday, holidays, or other unusual periods skews results. Run tests during "normal" periods when possible.
Carryover Effects
Users exposed to variant A then variant B carry memories. Use between-subjects designs (each user sees only one variant).
A Checklist to Avoid Pitfalls
Before You Start:
- Define one primary metric for decision-making
- Run a power analysis to determine sample size
- Commit to a fixed horizon (no peeking) or use sequential methods
- Document your test plan with hypothesis, metrics, and stopping criteria
While Running:
- Check for SRM daily. If detected, stop and investigate immediately
- Monitor implementation logs for errors or unusual patterns
- Don't peek at results if using fixed-horizon Frequentist approach
- Let the test run for at least 1-2 weeks to account for weekly cycles
After Results:
- Verify no SRM in final data
- Check if result is practically significant, not just statistically
- Look for segment differences (desktop vs mobile, new vs returning)
- Document learnings and share results, even if inconclusive
Key Takeaways
- ✓SRM (Sample Ratio Mismatch) indicates randomization bugs. Always check. Never ship with SRM.
- ✓Peeking inflates false positives. Use fixed-horizon or proper sequential testing methods.
- ✓Multiple comparisons increase false positive rate. Pre-specify one primary metric.
- ✓Novelty and primacy effects require longer test durations (1-2 weeks minimum).
- ✓Selection bias, seasonality, and carryover effects can all invalidate results.
- ✓Document everything upfront: hypothesis, metrics, stopping criteria, expected power.