The 10 Most Common A/B Testing Mistakes
Most A/B tests fail not because the idea was bad, but because the test itself was flawed. Learn the mistakes that invalidate results and waste months of work.
You've launched your first A/B test. The variant shows a 15% conversion lift with p < 0.05 after just two days. You ship it to 100% of users. Three weeks later, conversions are flat. What happened?
The answer lies in one or more of the ten mistakes that invalidate A/B tests. These errors are not obscure edge cases. They happen every day, even in mature experimentation programs, and they turn rigorous statistical methods into expensive coin flips.
Bad tests are worse than no tests. A flawed experiment gives you false confidence in the wrong direction, leading to shipping changes that hurt metrics, abandoning ideas that would have worked, and eroding trust in experimentation itself.
This guide walks through the ten most common mistakes beginners make, explains why each one invalidates your results, and shows you how to avoid them. At the end, you will find a pre-test checklist to run through before launching any experiment.
Peeking at Results Too Early
What it is: Checking test results before reaching your planned sample size and stopping the test when you see statistical significance.
Why it happens: Experimentation platforms show live p-values and confidence levels. The dashboard turns green at p < 0.05, and the temptation to declare victory and move on is overwhelming.
The damage: This practice, called optional stopping or the peeking problem, inflates your false positive rate dramatically. If you peek five times during a test, your actual false positive rate jumps from the intended 5% to over 14%. Continuous peeking can push it above 40%.
False Positive Rate vs. Number of Peeks
Real example: Statistician Evan Miller demonstrated this with a simulation. He generated random data with no real difference between groups, then "peeked" at the results 100 times. Despite there being no true effect, the test showed p < 0.05 at some point 40.1% of the time.
Every time you peek, you are essentially running a new statistical test. Each peek gives you another chance to observe a false positive by random chance.
How to avoid it
Calculate your required sample size before starting the test using power analysis. Commit to a fixed stopping rule. If you must monitor progress, use sequential testing methods designed for continuous monitoring, or simply hide the results until the planned end date.
Running Tests Without a Hypothesis
What it is: Launching a test with "Let's try this button color and see what happens" rather than a clear, testable hypothesis.
Why it happens: Excitement to start testing combined with a lack of research foundation. Teams want to move fast and feel that writing hypotheses slows them down.
The damage: Without a hypothesis, you have no learning framework. When the test comes back flat (which most tests do), you have learned nothing about user psychology or behavior. You cannot distinguish between "this change does not matter" and "we tested the wrong thing."
Real example: A team tests five button colors with no hypothesis about why color would affect behavior. The test returns a flat result. They learn that none of the colors they tried worked, but they still do not know whether color matters, whether they tested the wrong colors, or whether their traffic was too low to detect a real effect.
Hypothesis template
Use this structure: "Because we observed [data or user research finding], we believe that [specific change] will cause [measurable outcome] for [target audience]."
"Because we observed in session recordings that 40% of mobile users tap the CTA twice (indicating uncertainty about the action), we believe that changing the button text from 'Submit' to 'Get My Free Report' will increase conversion rate by at least 5% for mobile users."
How to avoid it
Ground every hypothesis in data. Use analytics to identify friction points, session recordings to observe behavior, user interviews to understand motivations, or prior experiment results. A good hypothesis is falsifiable and includes the expected mechanism of change.
Insufficient Sample Size (Underpowered Tests)
What it is: Running tests with too few users to reliably detect your target effect size.
Why it happens: Misunderstanding statistical power, traffic constraints, or simply launching tests without any sample size calculation.
The damage: Underpowered tests have high false negative rates. You miss real improvements and abandon good ideas. Research shows that 27% of randomized controlled trials published in prestigious medical journals were underpowered, unable to detect even a 50% difference in primary outcomes. In neuroscience, the average statistical power across published studies is dramatically low, leading to inconsistent and misleading literature.
Statistical Power by Sample Size and Effect Size
Real consequence: A study with 50% power has a coin-flip chance of detecting a real effect. If you run ten experiments with 50% power, half of your winning ideas will show up as "no effect," leading you to abandon them.
How to avoid it
Run a power analysis before every test. Input your baseline conversion rate, the minimum detectable effect you care about, and your desired power (80% is standard). The calculator tells you exactly how many users you need per variant. If you lack sufficient traffic, either accept a larger MDE or run the test longer.
Testing Too Many Things at Once
What it is: Changing multiple elements simultaneously in a single variant, such as updating the headline, hero image, CTA text, and page layout all at once.
Why it happens: Desire to "move fast" or confusion between multivariate testing and testing multiple changes.
The damage: When conversions move, you cannot isolate which change caused the outcome. You cannot build institutional knowledge about what works. If the test loses, you cannot tell which element was the culprit. If the test wins, you cannot replicate the insight elsewhere.
Real example: An e-commerce site changes the hero image, headline, and CTA button simultaneously. Conversions drop 8%. The team is forced to revert all three changes at once. They never learn whether one element performed well while another tanked, or whether all three failed. Months of planning and engineering effort produced zero institutional knowledge.
The multivariate trap
True multivariate testing (testing multiple elements with factorial designs) can work, but it requires 4x to 16x more traffic depending on the number of combinations. Most teams do not have the traffic to run proper MVTs. For beginners, stick to single-variable tests.
How to avoid it
Isolate one variable per test. If you want to test headline and image, run them as separate experiments. Yes, this takes longer, but you build a library of validated insights. Each test teaches you something you can apply to future work. Speed without learning is just motion.
Choosing the Wrong Primary Metric
What it is: Optimizing for clicks when you care about revenue, tracking vanity metrics like page views instead of outcomes, or using leading indicators that do not correlate with business results.
Why it happens: Choosing metrics that move easily rather than metrics that matter. Clicks are easier to shift than purchases, so teams optimize click-through rate.
The damage: Local optimization that hurts global goals. The classic example is boosting email open rates with clickbait subject lines, only to find that downstream conversions collapse because users feel deceived.
Real example: A SaaS company optimizes for trial signups and achieves a 30% lift. Six months later, they notice trial-to-paid conversion has dropped from 12% to 7%. The signup optimization attracted less qualified leads. They optimized a vanity metric at the expense of revenue.
Leading vs. lagging indicators
Leading indicators (clicks, signups, time on site) can be useful if they reliably predict lagging indicators (revenue, retention, LTV). Validate the correlation before optimizing leading indicators. If your leading indicator does not predict business outcomes, you are optimizing the wrong thing.
How to avoid it
Align your primary metric to the business objective. If the goal is revenue, track revenue per user. If the goal is engagement, track return visits or session depth. Ask "If this metric improves by 20% but revenue stays flat, would we still care?" If the answer is no, you have the wrong metric. See practical significance for more on choosing meaningful metrics.
Ignoring Guardrail Metrics
What it is: Only monitoring your primary metric while secondary metrics degrade unnoticed. Guardrail metrics, also called counter metrics, are KPIs you watch to ensure your experiment does not cause unintended harm.
Why it happens: Tunnel vision on "winning" the primary metric. Dashboards often emphasize the primary metric and relegate everything else to secondary tabs that nobody checks.
The damage: You ship changes that boost one KPI at the expense of user experience, revenue, or retention. Research from Spotify shows that without guardrails, teams optimize local metrics while degrading global product health.
Real example: A content site tests an aggressive paywall that increases subscription rate by 18%. Nobody checks bounce rate, which has spiked from 45% to 71%. Three months later, organic search traffic has collapsed because Google downranked the site for poor user experience. The subscription win was dwarfed by the SEO loss.
User Experience
Bounce rate, time on site, pages per session
Business Health
Revenue, profit margin, cart abandonment
Long-Term Value
Retention, repeat purchase, LTV
How to avoid it
Define 3 to 5 guardrail metrics before launching any test. Use non-inferiority testing to ensure guardrails do not degrade beyond an acceptable threshold. Read our full guide on guardrail metrics for implementation details.
Including Users Who Are Not Affected by the Change
What it is: Testing a checkout button color change on all site visitors, including the 90% who never reach the checkout page.
Why it happens: Simpler implementation. Bucketing all users into variant A or B at entry is easier than conditional assignment when they reach the affected page.
The damage: Massive dilution of effect size. If only 10% of users see the change, you need 10x more traffic to detect the same effect. Most of your sample contributes only noise.
Real example: A team tests a pricing page headline on all website traffic. Only 5% of visitors ever view the pricing page. The test runs for six weeks and shows no significant difference. Post-hoc analysis on the 5% who viewed pricing shows a 12% conversion lift, but the signal was buried in noise from the 95% who never saw the change.
Sample Size Required: Dilution Effect
How to avoid it
Trigger experiment assignment only when users reach the affected page or feature. If testing a checkout flow element, assign users to variants when they add items to cart, not at site entry. Filter your analysis to users who were exposed to the change. Most experimentation platforms support conditional bucketing.
Not Accounting for Seasonality and External Factors
What it is: Running a test during Black Friday and assuming results will hold in February. Launching experiments during holidays, major news events, or other periods of abnormal behavior.
Why it happens: Urgency to launch combined with not thinking about calendar effects or external validity.
The damage: Your test results do not replicate. What worked during the holiday spike fails during normal periods. External validity breaks down.
Real examples: B2B SaaS tests during December holidays when decision-makers are on vacation. E-commerce tests during promotional events when users behave differently than usual. Education products tested during summer when student traffic collapses.
High-risk periods to avoid
- Major shopping holidays (Black Friday, Christmas)
- Industry events (conferences, product launches)
- School calendars (for education products)
- End of quarter, end of year (B2B)
Best practices
- Run tests for full weekly cycles (7, 14, 21 days)
- Document known external events in test notes
- Compare year-over-year for seasonal products
- Validate results with holdout groups post-launch
How to avoid it
Avoid high-variance periods, or run tests long enough to capture a full cycle. For B2B products, run tests for at least 2 to 4 weeks to cover weekday and weekend behavior. Document the date range and any known external events. When results surprise you, check whether external factors explain the outcome before acting.
Stopping When You See a Winner (Confirmation Bias)
What it is: Running a test until you get the result you hoped for, then stopping. The inverse of peeking: continuing past your planned stopping point until the result looks favorable.
Why it happens: Outcome bias combined with organizational pressure to ship. Teams misunderstand "95% confidence" to mean "5% chance I am wrong," leading them to believe one more week might tip the scales.
The damage: Creates publication bias in your experiment log. Your documented win rate looks artificially high because you only record tests that eventually turned positive. Future decisions get made on false data.
Real pattern: Team runs test for two weeks, sees a negative 3% result. They decide to "wait for more data" and keep the test running. Week four shows a positive 2% result (by random chance). They declare victory and ship. The negative signal was real; the late positive was noise.
The sister problem to peeking
Both peeking (stopping early on positive) and cherry-picking (continuing until positive) inflate false positive rates. They are two sides of the same coin: letting the data determine your stopping rule rather than pre-committing to a plan.
How to avoid it
Pre-commit to your stopping rule before launching. Document the planned sample size or duration. When you reach that point, stop and evaluate, regardless of whether the result is positive, negative, or flat. Treat negative and null results as valuable learning. Use the duration calculator to set realistic timelines upfront.
Poor or No Documentation
What it is: Not recording the hypothesis, rationale, target audience, or results. Treating experiments as one-off rather than contributions to institutional knowledge.
Why it happens: Moving fast and treating documentation as bureaucracy. Teams focus on launching the next test rather than synthesizing learnings from the last one.
The damage: You repeat past mistakes. New team members lack context. You cannot synthesize patterns across experiments. Three different people test the same losing idea over 18 months because nobody documented it the first time.
Real consequence: A company runs 200 experiments over two years. When leadership asks "What have we learned?", nobody can answer. The experiments exist as disconnected data points with no narrative. The organizational learning rate is zero despite high experimentation velocity.
What to document for every test
- ✓Hypothesis: What you believed would happen and why
- ✓Design: What changed, sample size calculation, stopping rule
- ✓Audience: Who was included, any bucketing rules or filters
- ✓Results: Primary metric, guardrails, statistical analysis
- ✓Decision: What you shipped and why (or why you did not ship)
- ✓Learning: What this taught you about users, independent of outcome
How to avoid it
Use an experiment brief template. Store all experiments in a centralized log (Notion, Confluence, or a dedicated tool). Run quarterly review sessions where the team synthesizes patterns. Treat documentation as part of the experiment, not an afterthought. Read more about building experimentation teams that prioritize institutional learning.
Your Pre-Test Checklist: Avoid These Mistakes Before You Launch
Before launching your next A/B test, run through this checklist. Each item maps to one of the mistakes above. If you can check all twelve boxes, your test is set up for success.
Your progress is automatically saved in your browser and persists until you reset it. Bookmark this page to use as your pre-launch checklist.
Written hypothesis with expected outcome and rationale
Primary metric defined and aligned to business goal
3 to 5 guardrail metrics identified
Sample size calculated via power analysis
Minimum detectable effect (MDE) is practically significant
Statistical power ≥ 80%
Test duration accounts for weekly cycles (7+ days)
No major holidays or external events during test window
Bucketing limited to affected users only
Stopping rule pre-defined (time-based or sequential)
Documentation template ready (hypothesis, design, audience, success criteria)
Team aligned on decision framework (what happens with flat/negative/positive result)
Pro tip: Bookmark this interactive checklist
Your checklist progress is saved automatically in your browser and persists across sessions until you reset it. Bookmark this page to use as your pre-launch checklist for every test. Running through these twelve items before each launch takes five minutes and can save weeks of wasted effort on flawed tests.
References
- Miller, E. How Not To Run an A/B Test
- Georgiev, G. Underpowered A/B Tests – Confusions, Myths, and Reality
- Spotify Engineering. Risk-Aware Product Decisions in A/B Tests with Multiple Metrics
- PostHog. A/B testing mistakes I learned the hard way
- Reinhart, A. Statistical power and underpowered statistics
- Button, K. et al. (2013). Power failure: why small sample size undermines the reliability of neuroscience
- CXL. 12 A/B Split Testing Mistakes I See All the Time
Help others avoid these mistakes
If you found this guide valuable, share it with your team to help them run better experiments.
Related Resources
Analyze your experiment results.
Plan experiments with proper power analysis.
Determine your minimum detectable effect.
Learn when a result actually matters.
Protect experiments from hidden harm.
100+ experimentation terms explained.
Frequently Asked Questions
Run Better A/B Tests
Use our free calculators to plan tests with proper sample sizes and analyze results with statistical rigor.