Expert Articles
Our own guides and curated articles from industry experts on A/B testing methodology, statistical testing, sample sizing, and experimentation frameworks.
A curated collection of articles on A/B testing, statistical analysis, and experimentation strategy. These resources cover everything from sample size calculations and test design to advanced topics like multiple comparisons and interaction effects. Each article includes a summary and key takeaways to help you apply these concepts to your own experiments.
Our Articles
In-depth articles written by our team on experimentation methodology and decision making.
From the Community
Deep dives into specific topics and methodologies from industry experts
What You Will Learn From These Articles
This collection covers the core topics that experimentation practitioners encounter regularly. Each article addresses a specific problem with concrete guidance rather than general advice. Below you will find an overview by topic, followed by individual summaries and key takeaways for every article.
Experiment Design & Frameworks
How to decide what to test, how to document experiments so the reasoning survives team changes, and how to balance safe incremental tests with higher-risk exploratory ones. The articles on experiment briefs and explore vs exploit provide structured frameworks for portfolio-level test planning, while the scaling case study shows how these practices evolve as organizations grow.
Statistical Testing Methods
The choice of test type has direct consequences for required sample size, error rates, and what conclusions you can draw. These articles cover the one-tailed vs two-tailed debate with quantified tradeoffs, Kohavi's case for TOST equivalence testing, and non-inferiority testing for situations where “not worse” is the goal.
Sample Size & Statistical Power
Running a test without adequate power is the most common and most expensive mistake in experimentation. These articles cover the four factors that determine sample size, debunk persistent myths about post-hoc power analysis that lead teams to reject valid results, and provide formulas for computing required samples. Use our sample size calculator to apply these concepts.
Advanced Topics
Topics that matter once you move beyond basic conversion rate testing. Multiple comparisons explains why testing many metrics inflates false positives and how corrections like Bonferroni and Benjamini-Hochberg work. The non-binomial metrics article shows how to apply t-tests to revenue and other continuous metrics. Interaction effects covers what happens when overlapping experiments influence each other.
Article Summaries & Key Takeaways
What each article covers and the most important points to take away from it. For additional context on the statistical concepts discussed here, see our significance calculator and A/B testing glossary.
Explore vs Exploit: Finding the Balance in CRO
by David Sanchez
Read the full articleWhat this article covers
Applies the explore/exploit tradeoff from probability theory to CRO portfolios. Proposes frameworks like the 70-20-10 rule for balancing incremental optimization with bold hypothesis testing, with ratios shifting based on organizational maturity.
Key takeaways
- 1
Exploitation keeps revenue flowing but risks competitive obsolescence; exploration creates future growth but has higher failure rates. Neither alone is a viable strategy.
- 2
The 70-20-10 model (70% core optimization, 20% adjacent innovation, 10% transformational bets) provides a practical starting ratio, adapted from Google's resource allocation approach.
- 3
The right balance depends on company maturity: pre-product-market-fit companies should explore almost exclusively, while mature organizations shift toward exploitation with structured exploration.
Use Experiment Briefs to Design Better Experiments
by Bhavik Patel
Read the full articleWhat this article covers
Introduces experiment briefs as structured templates covering four phases: Plan, Configure, Monitor, and Analyze. Forces teams to document hypotheses, predetermined actions for each outcome, and sample size requirements before running any test.
Key takeaways
- 1
A "good experiment" is a well-designed experiment, not one that produces a winning result. Briefs enforce this distinction by separating design quality from outcome.
- 2
Predetermined actions prevent post-hoc rationalization. Teams decide what they will do with a win, loss, or flat result before seeing data.
- 3
Experiment briefs create institutional knowledge that survives team turnover, making the reasoning behind past decisions accessible to future team members.
Facts and Fictions About Multiple Comparisons
by Gail M. Sullivan
Read the full articleWhat this article covers
Examines how running multiple statistical tests inflates false positive rates. With just 13 independent comparisons at alpha 0.05, the probability of at least one spurious significant result reaches 50%. Covers correction methods from conservative (Bonferroni) to more nuanced (Benjamini-Hochberg).
Key takeaways
- 1
The probability of at least one false positive grows rapidly with each additional test: 10 comparisons at alpha 0.05 produce a 40% chance of a spurious significant result.
- 2
Bonferroni correction (dividing alpha by number of comparisons) is simple but conservative. Benjamini-Hochberg controls the false discovery rate and is often more practical for experimentation.
- 3
Prespecifying which comparisons you plan to run, and how many, is the single most effective defense against multiple comparison problems.
Experimentation: From startup to multiple agile teams
by Positive John
Read the full articleWhat this article covers
A case study tracing how an organization scaled its experimentation practice from ad-hoc testing in a startup environment to systematic experimentation across multiple agile teams. Covers the organizational and cultural shifts required at each growth stage.
Key takeaways
- 1
Early-stage experimentation is about building the habit: any test is better than no test, even if the methodology is imperfect.
- 2
Scaling experimentation requires dedicated ownership. Without someone responsible for test quality and prioritization, experimentation degrades as teams grow.
- 3
Cross-team experimentation needs shared infrastructure and standards to prevent conflicting tests and ensure consistent statistical rigor.
One-tailed vs Two-tailed Tests of Significance in A/B Testing
by Georgi Georgiev
Read the full articleWhat this article covers
Argues that one-tailed tests are preferable for most A/B testing scenarios because practitioners act differently depending on whether a result is positive or negative. One-tailed tests require 20-60% smaller sample sizes for the same statistical power and eliminate directional sign errors.
Key takeaways
- 1
The choice between one-tailed and two-tailed tests depends on your decision framework, not on statistical accuracy. Both are equally valid; they answer different questions.
- 2
One-tailed tests require substantially smaller samples (20-60% less traffic) to achieve the same power, directly reducing test duration.
- 3
Two-tailed tests introduce the possibility of Type III errors (getting the direction of the effect wrong), which one-tailed tests avoid by design.
Use Two-Sided Tests or Two One-Sided Tests (TOST) in A/B Testing
by Ron Kohavi
Read the full articleWhat this article covers
Kohavi argues that organizations sharing experiment results for learning or accountability should use two-sided tests or the TOST equivalence testing procedure. TOST provides a rigorous framework for declaring that a treatment has no meaningful effect, which standard null hypothesis testing cannot do.
Key takeaways
- 1
Two-sided tests protect against confirmation bias by treating both directions equally, making them appropriate when results are shared publicly or used for organizational learning.
- 2
TOST (Two One-Sided Tests) can rigorously declare "no meaningful effect," filling a gap that standard null hypothesis testing leaves open.
- 3
The choice between one-sided and two-sided/TOST depends on whether you need to convince others (use TOST/two-sided) or make internal go/no-go decisions (one-sided is acceptable).
Underpowered A/B Tests: Confusions, Myths, and Reality
by Georgi Georgiev
Read the full articleWhat this article covers
Debunks three common myths about underpowered tests. Distinguishes between hypothetical effect sizes (used in planning), true effect sizes (unknown), and observed effect sizes (from data). Demonstrates that requiring post-hoc power verification of significant results leads to rejecting valid findings.
Key takeaways
- 1
A statistically significant result does not need retroactive power verification. If the null hypothesis is rejected at the planned alpha level, the result stands regardless of post-hoc power calculations.
- 2
Conflating observed lift with the minimum detectable effect leads to unnecessarily conservative decisions, sometimes making error control 50x more stringent than intended.
- 3
Proper power analysis belongs in the planning phase. Using it to evaluate completed tests is a category error that confuses planning parameters with inferential outcomes.
Statistical Significance for Non-Binomial Metrics
by Georgi Georgiev
Read the full articleWhat this article covers
Explains how to test statistical significance for continuous metrics like revenue per user, average order value, and pages per session. The key insight is that the sample statistic distribution (not the raw data distribution) determines test validity, and the Central Limit Theorem makes standard t-tests appropriate even for highly skewed metrics.
Key takeaways
- 1
Non-normal data distributions do not invalidate t-tests. The Central Limit Theorem ensures the sampling distribution of means is approximately normal regardless of the underlying data shape.
- 2
Testing non-binomial metrics requires extracting user-level or session-level data to estimate variance, then applying standard t-test methodology with the pooled standard error of the mean.
- 3
For multiple metric comparisons, Dunnett's correction provides tighter confidence intervals than Bonferroni while still controlling family-wise error rate.
Interaction effects in online experimentation
by Lukas Vermeer
Read the full articleWhat this article covers
Distinguishes two types of overlapping experiment interactions: traffic interactions (where one experiment alters user flow to another's pages, creating sampling bias) and metric interactions (where combined effects differ from the sum of isolated effects). Argues that interaction detection is more valuable than blanket avoidance of overlapping tests.
Key takeaways
- 1
Not all overlapping experiments interact. Traffic interactions create sampling bias, while metric interactions produce genuinely different combined effects. The distinction matters for diagnosis and resolution.
- 2
Detected interactions often reveal functional conflicts degrading user experience, making them diagnostic signals rather than merely statistical nuisances.
- 3
Organizations should focus on detecting interactions rather than preventing all experiment overlap, because detection leads to new insights while avoidance limits experimentation velocity.
Is non-inferiority on par with superiority?
by Keith Goldfeld
Read the full articleWhat this article covers
Examines the statistical mechanics of non-inferiority testing compared to superiority testing. The core challenge is the non-inferiority margin (delta): unlike superiority tests that compare against zero, non-inferiority requires a subjective threshold that directly influences conclusions.
Key takeaways
- 1
Non-inferiority testing shifts the null hypothesis from "no difference" to "the new treatment is worse by at least delta," making the choice of delta the most consequential design decision.
- 2
The same data can yield different non-inferiority conclusions depending on the chosen margin, while superiority conclusions remain stable. This makes margin selection a point of vulnerability.
- 3
Principled delta selection based on clinical or business relevance matters more than simply increasing sample size for making compelling non-inferiority claims.
Power and Sample Size Determination
by Lisa Sullivan, PhD
Read the full articleWhat this article covers
A comprehensive academic module covering the factors that determine required sample sizes for hypothesis testing. Walks through formulas for comparing means, proportions, and testing associations, emphasizing the interplay between significance level, power, effect size, and variability.
Key takeaways
- 1
Four factors determine sample size: significance level (alpha), desired power (1 minus beta), the minimum effect size of interest, and the variability of the outcome. Changing any one directly affects the required sample.
- 2
Power and sample size have a nonlinear relationship. Increasing power from 80% to 90% requires substantially more than a 12.5% increase in sample size.
- 3
Effect size estimation is the most difficult and most important input. Overly optimistic effect size assumptions lead to underpowered tests, while conservative estimates lead to unnecessarily long tests.
Put these ideas into practice
Use these free tools and guides alongside your reading to apply statistical concepts and run better experiments.
A/B Test Significance Calculator
Analyze your experiment results
Sample Size Calculator
Plan experiments with proper power analysis
A/B Testing Books
Essential reading on experimentation
A/B Testing Glossary
100+ experimentation terms explained
A/B Testing Courses
Structured learning paths for experimentation
A/B Testing Tools
Compare experimentation platforms