What is the most common A/B testing mistake beginners make?

The most common mistake is peeking at results before reaching the planned sample size and stopping the test early when seeing positive results. This practice, called optional stopping, can inflate false positive rates from 5% to over 40%, making your results unreliable and leading to poor business decisions.

How long should I run an A/B test?

Run your A/B test until you reach the pre-calculated sample size from power analysis, typically requiring at least one full business cycle (7 days for consumer products). The duration depends on your traffic volume and minimum detectable effect. Never stop early just because you see statistical significance, as this invalidates the results.

What is the peeking problem in A/B testing?

The peeking problem, also known as optional stopping, occurs when you repeatedly check test results and stop the test as soon as you observe statistical significance. Each time you peek, you increase the probability of a false positive. Continuous peeking can inflate your false positive rate from the intended 5% to over 40%.

What is statistical power and why does it matter?

Statistical power is the probability that your test will detect a real effect if one exists. A test with 80% power has an 80% chance of detecting your target effect size. Underpowered tests (below 80%) have high false negative rates, causing you to miss real improvements and abandon good ideas prematurely.

What is a guardrail metric in A/B testing?

A guardrail metric, also called a counter metric, is a business-critical KPI you monitor to ensure your test doesn't cause unintended harm. For example, while testing to increase clicks, you would monitor revenue as a guardrail to ensure clicks do not come at the expense of conversions. Most tests should have 3 to 5 guardrail metrics.

How do I calculate the right sample size for my A/B test?

Use a sample size calculator with four inputs: your baseline conversion rate, minimum detectable effect (smallest meaningful improvement), desired statistical power (typically 80%), and significance level (typically 95%). The calculator determines how many users per variation you need to reliably detect your target effect size.

Can I test multiple changes at the same time?

You can test multiple changes simultaneously using multivariate testing, but it requires significantly more traffic (often 4x to 16x more) and more complex analysis. For beginners, test one change at a time so you can clearly understand what drove the result and build institutional knowledge about what works.

What should I do if my A/B test result is not statistically significant?

If your test reaches the planned sample size without achieving statistical significance, you have three options: accept the null result (no meaningful difference), extend the test if you had low power (below 80%), or conclude the variant has less than your minimum detectable effect. Never keep running indefinitely hoping for significance, as this causes the peeking problem.

The 10 Most Common A/B Testing Mistakes

Most A/B tests fail not because the idea was bad, but because the test itself was flawed. Learn the mistakes that invalidate results and waste months of work.

Andrea Corvi

Last updated: 3 February 2026

You've launched your first A/B test. The variant shows a 15% conversion lift with p < 0.05 after just two days. You ship it to 100% of users. Three weeks later, conversions are flat. What happened?

The answer lies in one or more of the ten mistakes that invalidate A/B tests. These errors are not obscure edge cases. They happen every day, even in mature experimentation programs, and they turn rigorous statistical methods into expensive coin flips.

Bad tests are worse than no tests. A flawed experiment gives you false confidence in the wrong direction, leading to shipping changes that hurt metrics, abandoning ideas that would have worked, and eroding trust in experimentation itself.

This guide walks through the ten most common mistakes beginners make, explains why each one invalidates your results, and shows you how to avoid them. At the end, you will find a pre-test checklist to run through before launching any experiment.

Peeking at Results Too Early

What it is: Checking test results before reaching your planned sample size and stopping the test when you see statistical significance.

Why it happens: Experimentation platforms show live p-values and confidence levels. The dashboard turns green at p < 0.05, and the temptation to declare victory and move on is overwhelming.

The damage: This practice, called optional stopping or the peeking problem, inflates your false positive rate dramatically. If you peek five times during a test, your actual false positive rate jumps from the intended 5% to over 14%. Continuous peeking can push it above 40%.

False Positive Rate vs. Number of Peeks

Expected (5%)

With Peeking

Real example: Statistician Evan Miller demonstrated this with a simulation. He generated random data with no real difference between groups, then "peeked" at the results 100 times. Despite there being no true effect, the test showed p < 0.05 at some point 40.1% of the time.

Every time you peek, you are essentially running a new statistical test. Each peek gives you another chance to observe a false positive by random chance.

How to avoid it

Calculate your required sample size before starting the test using power analysis. Commit to a fixed stopping rule. If you must monitor progress, use sequential testing methods designed for continuous monitoring, or simply hide the results until the planned end date.

Running Tests Without a Hypothesis

What it is: Launching a test with "Let's try this button color and see what happens" rather than a clear, testable hypothesis.

Why it happens: Excitement to start testing combined with a lack of research foundation. Teams want to move fast and feel that writing hypotheses slows them down.

The damage: Without a hypothesis, you have no learning framework. When the test comes back flat (which most tests do), you have learned nothing about user psychology or behavior. You cannot distinguish between "this change does not matter" and "we tested the wrong thing."

Real example: A team tests five button colors with no hypothesis about why color would affect behavior. The test returns a flat result. They learn that none of the colors they tried worked, but they still do not know whether color matters, whether they tested the wrong colors, or whether their traffic was too low to detect a real effect.

Hypothesis template

Use this structure: "Because we observed [data or user research finding], we believe that [specific change] will cause [measurable outcome] for [target audience]."

"Because we observed in session recordings that 40% of mobile users tap the CTA twice (indicating uncertainty about the action), we believe that changing the button text from 'Submit' to 'Get My Free Report' will increase conversion rate by at least 5% for mobile users."

How to avoid it

Ground every hypothesis in data. Use analytics to identify friction points, session recordings to observe behavior, user interviews to understand motivations, or prior experiment results. A good hypothesis is falsifiable and includes the expected mechanism of change.

Insufficient Sample Size (Underpowered Tests)

What it is: Running tests with too few users to reliably detect your target effect size.

Why it happens: Misunderstanding statistical power, traffic constraints, or simply launching tests without any sample size calculation.

The damage: Underpowered tests have high false negative rates. You miss real improvements and abandon good ideas. Research shows that 27% of randomized controlled trials published in prestigious medical journals were underpowered, unable to detect even a 50% difference in primary outcomes. In neuroscience, the average statistical power across published studies is dramatically low, leading to inconsistent and misleading literature.

Statistical Power by Sample Size and Effect Size

10% Effect Size

5% Effect Size

2% Effect Size

Real consequence: A study with 50% power has a coin-flip chance of detecting a real effect. If you run ten experiments with 50% power, half of your winning ideas will show up as "no effect," leading you to abandon them.

How to avoid it

Run a power analysis before every test. Input your baseline conversion rate, the minimum detectable effect you care about, and your desired power (80% is standard). The calculator tells you exactly how many users you need per variant. If you lack sufficient traffic, either accept a larger MDE or run the test longer.

Testing Too Many Things at Once

What it is: Changing multiple elements simultaneously in a single variant, such as updating the headline, hero image, CTA text, and page layout all at once.

Why it happens: Desire to "move fast" or confusion between multivariate testing and testing multiple changes.

The damage: When conversions move, you cannot isolate which change caused the outcome. You cannot build institutional knowledge about what works. If the test loses, you cannot tell which element was the culprit. If the test wins, you cannot replicate the insight elsewhere.

Real example: An e-commerce site changes the hero image, headline, and CTA button simultaneously. Conversions drop 8%. The team is forced to revert all three changes at once. They never learn whether one element performed well while another tanked, or whether all three failed. Months of planning and engineering effort produced zero institutional knowledge.

The multivariate trap

True multivariate testing (testing multiple elements with factorial designs) can work, but it requires 4x to 16x more traffic depending on the number of combinations. Most teams do not have the traffic to run proper MVTs. For beginners, stick to single-variable tests.

How to avoid it

Isolate one variable per test. If you want to test headline and image, run them as separate experiments. Yes, this takes longer, but you build a library of validated insights. Each test teaches you something you can apply to future work. Speed without learning is just motion.

Choosing the Wrong Primary Metric

What it is: Optimizing for clicks when you care about revenue, tracking vanity metrics like page views instead of outcomes, or using leading indicators that do not correlate with business results.

Why it happens: Choosing metrics that move easily rather than metrics that matter. Clicks are easier to shift than purchases, so teams optimize click-through rate.

The damage: Local optimization that hurts global goals. The classic example is boosting email open rates with clickbait subject lines, only to find that downstream conversions collapse because users feel deceived.

Real example: A SaaS company optimizes for trial signups and achieves a 30% lift. Six months later, they notice trial-to-paid conversion has dropped from 12% to 7%. The signup optimization attracted less qualified leads. They optimized a vanity metric at the expense of revenue.

Leading vs. lagging indicators

Leading indicators (clicks, signups, time on site) can be useful if they reliably predict lagging indicators (revenue, retention, LTV). Validate the correlation before optimizing leading indicators. If your leading indicator does not predict business outcomes, you are optimizing the wrong thing.

How to avoid it

Align your primary metric to the business objective. If the goal is revenue, track revenue per user. If the goal is engagement, track return visits or session depth. Ask "If this metric improves by 20% but revenue stays flat, would we still care?" If the answer is no, you have the wrong metric. See practical significance for more on choosing meaningful metrics.

Ignoring Guardrail Metrics

What it is: Only monitoring your primary metric while secondary metrics degrade unnoticed. Guardrail metrics, also called counter metrics, are KPIs you watch to ensure your experiment does not cause unintended harm.

Why it happens: Tunnel vision on "winning" the primary metric. Dashboards often emphasize the primary metric and relegate everything else to secondary tabs that nobody checks.

The damage: You ship changes that boost one KPI at the expense of user experience, revenue, or retention. Research from Spotify shows that without guardrails, teams optimize local metrics while degrading global product health.

Real example: A content site tests an aggressive paywall that increases subscription rate by 18%. Nobody checks bounce rate, which has spiked from 45% to 71%. Three months later, organic search traffic has collapsed because Google downranked the site for poor user experience. The subscription win was dwarfed by the SEO loss.

User Experience

Bounce rate, time on site, pages per session

Business Health

Revenue, profit margin, cart abandonment

Long-Term Value

Retention, repeat purchase, LTV

How to avoid it

Define 3 to 5 guardrail metrics before launching any test. Use non-inferiority testing to ensure guardrails do not degrade beyond an acceptable threshold. Read our full guide on guardrail metrics for implementation details.

Including Users Who Are Not Affected by the Change

What it is: Testing a checkout button color change on all site visitors, including the 90% who never reach the checkout page.

Why it happens: Simpler implementation. Bucketing all users into variant A or B at entry is easier than conditional assignment when they reach the affected page.

The damage: Massive dilution of effect size. If only 10% of users see the change, you need 10x more traffic to detect the same effect. Most of your sample contributes only noise.

Real example: A team tests a pricing page headline on all website traffic. Only 5% of visitors ever view the pricing page. The test runs for six weeks and shows no significant difference. Post-hoc analysis on the 5% who viewed pricing shows a 12% conversion lift, but the signal was buried in noise from the 95% who never saw the change.

Sample Size Required: Dilution Effect

Including unaffected users

Affected users only

How to avoid it

Trigger experiment assignment only when users reach the affected page or feature. If testing a checkout flow element, assign users to variants when they add items to cart, not at site entry. Filter your analysis to users who were exposed to the change. Most experimentation platforms support conditional bucketing.

Not Accounting for Seasonality and External Factors

What it is: Running a test during Black Friday and assuming results will hold in February. Launching experiments during holidays, major news events, or other periods of abnormal behavior.

Why it happens: Urgency to launch combined with not thinking about calendar effects or external validity.

The damage: Your test results do not replicate. What worked during the holiday spike fails during normal periods. External validity breaks down.

Real examples: B2B SaaS tests during December holidays when decision-makers are on vacation. E-commerce tests during promotional events when users behave differently than usual. Education products tested during summer when student traffic collapses.

High-risk periods to avoid

Major shopping holidays (Black Friday, Christmas)
Industry events (conferences, product launches)
School calendars (for education products)
End of quarter, end of year (B2B)

Best practices

Run tests for full weekly cycles (7, 14, 21 days)
Document known external events in test notes
Compare year-over-year for seasonal products
Validate results with holdout groups post-launch

How to avoid it

Avoid high-variance periods, or run tests long enough to capture a full cycle. For B2B products, run tests for at least 2 to 4 weeks to cover weekday and weekend behavior. Document the date range and any known external events. When results surprise you, check whether external factors explain the outcome before acting.

Stopping When You See a Winner (Confirmation Bias)

What it is: Running a test until you get the result you hoped for, then stopping. The inverse of peeking: continuing past your planned stopping point until the result looks favorable.

Why it happens: Outcome bias combined with organizational pressure to ship. Teams misunderstand "95% confidence" to mean "5% chance I am wrong," leading them to believe one more week might tip the scales.

The damage: Creates publication bias in your experiment log. Your documented win rate looks artificially high because you only record tests that eventually turned positive. Future decisions get made on false data.

Real pattern: Team runs test for two weeks, sees a negative 3% result. They decide to "wait for more data" and keep the test running. Week four shows a positive 2% result (by random chance). They declare victory and ship. The negative signal was real; the late positive was noise.

The sister problem to peeking

Both peeking (stopping early on positive) and cherry-picking (continuing until positive) inflate false positive rates. They are two sides of the same coin: letting the data determine your stopping rule rather than pre-committing to a plan.

How to avoid it

Pre-commit to your stopping rule before launching. Document the planned sample size or duration. When you reach that point, stop and evaluate, regardless of whether the result is positive, negative, or flat. Treat negative and null results as valuable learning. Use the duration calculator to set realistic timelines upfront.

Poor or No Documentation

What it is: Not recording the hypothesis, rationale, target audience, or results. Treating experiments as one-off rather than contributions to institutional knowledge.

Why it happens: Moving fast and treating documentation as bureaucracy. Teams focus on launching the next test rather than synthesizing learnings from the last one.

The damage: You repeat past mistakes. New team members lack context. You cannot synthesize patterns across experiments. Three different people test the same losing idea over 18 months because nobody documented it the first time.

Real consequence: A company runs 200 experiments over two years. When leadership asks "What have we learned?", nobody can answer. The experiments exist as disconnected data points with no narrative. The organizational learning rate is zero despite high experimentation velocity.

What to document for every test

✓Hypothesis: What you believed would happen and why
✓Design: What changed, sample size calculation, stopping rule
✓Audience: Who was included, any bucketing rules or filters
✓Results: Primary metric, guardrails, statistical analysis
✓Decision: What you shipped and why (or why you did not ship)
✓Learning: What this taught you about users, independent of outcome

How to avoid it

Use an experiment brief template. Store all experiments in a centralized log (Notion, Confluence, or a dedicated tool). Run quarterly review sessions where the team synthesizes patterns. Treat documentation as part of the experiment, not an afterthought. Read more about building experimentation teams that prioritize institutional learning.

Your Pre-Test Checklist: Avoid These Mistakes Before You Launch

Before launching your next A/B test, run through this checklist. Each item maps to one of the mistakes above. If you can check all twelve boxes, your test is set up for success.

Your progress is automatically saved in your browser and persists until you reset it. Bookmark this page to use as your pre-launch checklist.

Progress: 0 of 12 items completed

Written hypothesis with expected outcome and rationale

Primary metric defined and aligned to business goal

3 to 5 guardrail metrics identified

Sample size calculated via power analysis

Minimum detectable effect (MDE) is practically significant

Statistical power ≥ 80%

Test duration accounts for weekly cycles (7+ days)

No major holidays or external events during test window

Bucketing limited to affected users only

Stopping rule pre-defined (time-based or sequential)

Documentation template ready (hypothesis, design, audience, success criteria)

Team aligned on decision framework (what happens with flat/negative/positive result)

Pro tip: Bookmark this interactive checklist

Your checklist progress is saved automatically in your browser and persists across sessions until you reset it. Bookmark this page to use as your pre-launch checklist for every test. Running through these twelve items before each launch takes five minutes and can save weeks of wasted effort on flawed tests.

References

Share this article

Help others avoid these mistakes

If you found this guide valuable, share it with your team to help them run better experiments.

Related Resources

Significance Calculator

Analyze your experiment results.

Sample Size Calculator

Plan experiments with proper power analysis.

MDE Calculator

Determine your minimum detectable effect.

Practical Significance

Learn when a result actually matters.

Guardrail Metrics

Protect experiments from hidden harm.

Glossary

100+ experimentation terms explained.

Frequently Asked Questions

Run Better A/B Tests

Use our free calculators to plan tests with proper sample sizes and analyze results with statistical rigor.

Analyze Test Results

Calculate Sample Size

Estimate MDE

Why it happens: Tunnel vision on "winning" the primary metric. Dashboards often emphasize the primary metric and relegate everything else to secondary tabs that nobody checks.

Frequently Asked Questions

The 10 Most Common A/B Testing Mistakes

Peeking at Results Too Early

False Positive Rate vs. Number of Peeks

How to read this chart

Running Tests Without a Hypothesis

Insufficient Sample Size (Underpowered Tests)

Statistical Power by Sample Size and Effect Size

How to read this chart

Testing Too Many Things at Once

Choosing the Wrong Primary Metric

Ignoring Guardrail Metrics

Including Users Who Are Not Affected by the Change

Sample Size Required: Dilution Effect

How to read this chart

Not Accounting for Seasonality and External Factors

Stopping When You See a Winner (Confirmation Bias)

Poor or No Documentation

Your Pre-Test Checklist: Avoid These Mistakes Before You Launch

References

Help others avoid these mistakes

Related Resources

Frequently Asked Questions

What is the most common A/B testing mistake beginners make?

How long should I run an A/B test?

What is the peeking problem in A/B testing?

What is statistical power and why does it matter?

What is a guardrail metric in A/B testing?

How do I calculate the right sample size for my A/B test?

Can I test multiple changes at the same time?

What should I do if my A/B test result is not statistically significant?

Run Better A/B Tests

The 10 Most Common A/B Testing Mistakes

Peeking at Results Too Early

False Positive Rate vs. Number of Peeks

How to read this chart

Running Tests Without a Hypothesis

Insufficient Sample Size (Underpowered Tests)

Statistical Power by Sample Size and Effect Size

How to read this chart

Testing Too Many Things at Once

Choosing the Wrong Primary Metric

Ignoring Guardrail Metrics

Including Users Who Are Not Affected by the Change

Sample Size Required: Dilution Effect

How to read this chart

Not Accounting for Seasonality and External Factors

Stopping When You See a Winner (Confirmation Bias)

Poor or No Documentation

Your Pre-Test Checklist: Avoid These Mistakes Before You Launch

References

Help others avoid these mistakes

Related Resources

Frequently Asked Questions

What is the most common A/B testing mistake beginners make?

How long should I run an A/B test?

What is the peeking problem in A/B testing?

What is statistical power and why does it matter?

What is a guardrail metric in A/B testing?

How do I calculate the right sample size for my A/B test?

Can I test multiple changes at the same time?

What should I do if my A/B test result is not statistically significant?

Run Better A/B Tests