Module 4 • Section 2 • Final Section

Common Pitfalls: Avoiding Test Failures

SRM, peeking problems, multiple comparisons, and other mistakes that invalidate results.

12 min read

What Separates Experts from Beginners

You've learned the fundamentals, the statistics, and the methodology. Now let's cover the mistakes that invalidate even well-designed tests. Knowing these pitfalls is what separates beginners from experts.

These aren't theoretical concerns-they're real issues that plague production A/B tests every day. Recognizing and avoiding them will save you from wasted time and bad decisions.

The Big Three Pitfalls

Sample Ratio Mismatch (SRM)

What it is: When the actual split doesn't match the intended split. If you wanted 50/50 but got 52/48, something's wrong with your randomization.

Example:

Expected: 10,000 users per variant
Actual: Control 10,523 / Variant 9,477
SRM detected!

Why it matters:

Indicates a bug in randomization. Could be browser issues, bot filtering, or technical errors. Invalidates your test.

How to detect:

Run a chi-squared test on your sample sizes. If p-value < 0.001, you have SRM.

What to do:

Stop the test immediately. Investigate the bug, fix it, and restart. Never ship results from a test with SRM.

The Peeking Problem

What it is: Checking results repeatedly during a test and stopping early when you see significance. This inflates your false positive rate from 5% to 20-30%.

Why it's tempting:

"The test hit significance on Day 3! Let's stop early and ship it." Sounds efficient, but it's wrong.

Why it's wrong:

Early in a test, metrics bounce around. What looks significant on Day 3 often regresses by Day 7.

Solutions:

1. Fixed horizon: Decide sample size upfront, don't peek. Run to completion.
2. Sequential testing: Use Bayesian methods or group sequential designs that allow peeking with corrections.
3. Always Valid Inference (AVI): Use confidence sequences that remain valid at any stopping point.

Best practice: If using Frequentist methods, commit to a fixed sample size and don't peek.

Multiple Comparisons Problem

What it is: Testing multiple metrics or variants without adjusting significance thresholds. Each test adds more chances for a false positive.

Example:

You test 10 metrics. Each has 5% false positive rate. Overall false positive rate? Nearly 40%!

The math:

P(at least one false positive) = 1 - (0.95)^n
For n=10 tests: 1 - (0.95)^10 = 40%

Solutions:

1. Pre-specify primary metric: Choose ONE decision metric upfront. Others are exploratory only.
2. Bonferroni correction: Divide α by number of tests (e.g., 0.05/10 = 0.005 threshold).
3. False Discovery Rate (FDR): Control the proportion of false positives using Benjamini-Hochberg.

Best practice: Define your success metric before the test. If you must check multiple metrics, use corrections.

Other Common Mistakes

Novelty Effect

Users engage more with new features initially, then behavior normalizes. Run tests for 1-2 weeks minimum to account for this.

Primacy Effect

Returning users prefer the old experience. They need time to adapt. Consider separating new vs returning users in analysis.

Selection Bias

Testing only on engaged users (e.g., logged-in, desktop-only) limits generalizability. Define your population carefully.

Ignoring Seasonality

Testing during Black Friday, holidays, or other unusual periods skews results. Run tests during "normal" periods when possible.

Carryover Effects

Users exposed to variant A then variant B carry memories. Use between-subjects designs (each user sees only one variant).

A Checklist to Avoid Pitfalls

Before You Start:

Define one primary metric for decision-making
Run a power analysis to determine sample size
Commit to a fixed horizon (no peeking) or use sequential methods
Document your test plan with hypothesis, metrics, and stopping criteria

While Running:

Check for SRM daily. If detected, stop and investigate immediately
Monitor implementation logs for errors or unusual patterns
Don't peek at results if using fixed-horizon Frequentist approach
Let the test run for at least 1-2 weeks to account for weekly cycles

After Results:

Verify no SRM in final data
Check if result is practically significant, not just statistically
Look for segment differences (desktop vs mobile, new vs returning)
Document learnings and share results, even if inconclusive

Key Takeaways

✓SRM (Sample Ratio Mismatch) indicates randomization bugs. Always check. Never ship with SRM.
✓Peeking inflates false positives. Use fixed-horizon or proper sequential testing methods.
✓Multiple comparisons increase false positive rate. Pre-specify one primary metric.
✓Novelty and primacy effects require longer test durations (1-2 weeks minimum).
✓Selection bias, seasonality, and carryover effects can all invalidate results.
✓Document everything upfront: hypothesis, metrics, stopping criteria, expected power.

Quiz

Question 1 of 4

What is Sample Ratio Mismatch (SRM)?

Course navigation

Pass the quiz to unlock the next section.

The Big Three Pitfalls

Sample Ratio Mismatch (SRM)

What it is: When the actual split doesn't match the intended split. If you wanted 50/50 but got 52/48, something's wrong with your randomization.

Example:

Expected: 10,000 users per variant
Actual: Control 10,523 / Variant 9,477
SRM detected!

Why it matters:

Indicates a bug in randomization. Could be browser issues, bot filtering, or technical errors. Invalidates your test.

How to detect:

Run a chi-squared test on your sample sizes. If p-value < 0.001, you have SRM.

What to do:

Stop the test immediately. Investigate the bug, fix it, and restart. Never ship results from a test with SRM.

The Peeking Problem

What it is: Checking results repeatedly during a test and stopping early when you see significance. This inflates your false positive rate from 5% to 20-30%.

Why it's tempting:

"The test hit significance on Day 3! Let's stop early and ship it." Sounds efficient, but it's wrong.

Why it's wrong:

Early in a test, metrics bounce around. What looks significant on Day 3 often regresses by Day 7.

Solutions:

1. Fixed horizon: Decide sample size upfront, don't peek. Run to completion.
2. Sequential testing: Use Bayesian methods or group sequential designs that allow peeking with corrections.
3. Always Valid Inference (AVI): Use confidence sequences that remain valid at any stopping point.

Best practice: If using Frequentist methods, commit to a fixed sample size and don't peek.

Multiple Comparisons Problem

What it is: Testing multiple metrics or variants without adjusting significance thresholds. Each test adds more chances for a false positive.

Example:

You test 10 metrics. Each has 5% false positive rate. Overall false positive rate? Nearly 40%!

The math:

P(at least one false positive) = 1 - (0.95)^n
For n=10 tests: 1 - (0.95)^10 = 40%

Solutions:

1. Pre-specify primary metric: Choose ONE decision metric upfront. Others are exploratory only.
2. Bonferroni correction: Divide α by number of tests (e.g., 0.05/10 = 0.005 threshold).
3. False Discovery Rate (FDR): Control the proportion of false positives using Benjamini-Hochberg.

Best practice: Define your success metric before the test. If you must check multiple metrics, use corrections.

Other Common Mistakes

Novelty Effect

Users engage more with new features initially, then behavior normalizes. Run tests for 1-2 weeks minimum to account for this.

Primacy Effect

Returning users prefer the old experience. They need time to adapt. Consider separating new vs returning users in analysis.

Selection Bias

Testing only on engaged users (e.g., logged-in, desktop-only) limits generalizability. Define your population carefully.

Ignoring Seasonality

Testing during Black Friday, holidays, or other unusual periods skews results. Run tests during "normal" periods when possible.

Carryover Effects

Users exposed to variant A then variant B carry memories. Use between-subjects designs (each user sees only one variant).

A Checklist to Avoid Pitfalls

Before You Start:

Define one primary metric for decision-making
Run a power analysis to determine sample size
Commit to a fixed horizon (no peeking) or use sequential methods
Document your test plan with hypothesis, metrics, and stopping criteria

While Running:

Check for SRM daily. If detected, stop and investigate immediately
Monitor implementation logs for errors or unusual patterns
Don't peek at results if using fixed-horizon Frequentist approach
Let the test run for at least 1-2 weeks to account for weekly cycles

After Results:

Verify no SRM in final data
Check if result is practically significant, not just statistically
Look for segment differences (desktop vs mobile, new vs returning)
Document learnings and share results, even if inconclusive