20 A/B Testing Interview Questions for 2026

Q: Explain p-values vs confidence intervals (and what each does NOT mean)

A p-value is the probability of seeing your data (or more extreme) if the null hypothesis were true. It does NOT tell you the probability that the null is true. A 95% confidence interval gives a range where the true effect likely falls. It does NOT mean there's a 95% chance the true value is in that range for THIS specific interval—it means 95% of such intervals would contain the true value. Common trap: p < 0.05 does not mean a 95% chance of being right.

Q: Type I/II errors, power, MDE: how do you pick targets in practice?

Type I error (false positive, α) is rejecting a true null—typically set at 5%. Type II error (false negative, β) is missing a real effect—power = 1-β, typically 80%. MDE (Minimum Detectable Effect) is the smallest effect your test can reliably find. In practice: set α = 0.05 (or 0.01 for high-risk), power = 80%, then choose MDE based on business needs and traffic constraints. Smaller MDE requires exponentially more sample size.

Q: Practical vs statistical significance: how do you decide to ship?

Statistical significance (p < 0.05) tells you an effect is likely real. Practical significance asks: is it large enough to matter? A test can be statistically significant but practically irrelevant (0.01% lift with millions of users) or vice versa (10% lift in an underpowered pilot). Decision rule: check if the entire confidence interval exceeds your Minimum Effect of Interest (MEI). If the lower bound of the CI is above MEI, ship. If it overlaps, consider trade-offs.

Q: What assumptions sit behind t-tests/z-tests, and how do you check them?

Z-test/t-test assumptions: (1) independence of observations, (2) approximate normality of sampling distribution (Central Limit Theorem helps with large n), (3) equal variance (for two-sample tests). Check: (1) no carryover effects between users, (2) sample size >30 per group for CLT, (3) variance ratio <3:1, or use Welch's t-test for unequal variance. For conversion rates, binomial proportions test is more appropriate than t-test.

Q: Choosing metrics: north star vs guardrails; what's a good guardrail strategy?

North star metric aligns to long-term business goal (revenue, retention, active users). Primary metric for the test is the most sensitive proxy you can measure in the test window. Guardrails protect against unintended harm: track 3-5 counter metrics like engagement, quality, latency, or downstream conversion. Use non-inferiority testing for guardrails—don't just check 'not significant,' prove they're not worse by >X%.

Scenario-driven questions covering SRM, sequential testing, variance reduction (CUPED), guardrails, and modern experimentation practices. Real answers, practical drills, and what interviewers actually test for.

Andrea Corvi

Last updated: 3 February 2026

A/B testing interviews in 2026 demand more than memorizing p-value definitions. Interviewers expect you to design tests from scratch, debug flawed experiments, make ship/no-ship decisions under uncertainty, and communicate trade-offs to non-technical stakeholders. Whether you're interviewing for a data scientist, product manager, or experimentation analyst role, you need scenario-based judgment, not just textbook knowledge.

This guide covers the 20 questions that separate strong candidates from weak ones. Each question includes a one-minute answer, what the interviewer is really testing, common mistakes, a mini drill to practice, likely follow-ups, and when to escalate. After the questions, you'll find scenario drills to simulate real interview challenges and cheat sheets for formulas and red flags.

How to use this guide

Read each question → answer aloud without looking → do the mini drill → review follow-ups → check when to escalate. Practice the scenario drills at the end like real interview case studies. Use the cheat sheets as quick reference material.

Questions 1-5: Fundamentals

What makes an A/B test causal, and what breaks causality?

One-minute answer

An A/B test is causal because of proper randomization: users are randomly assigned to control or treatment, creating statistically equivalent groups. This means any observed difference in outcomes is attributable to the intervention, not pre-existing differences. Causality breaks when: (1) randomization fails (e.g., assignment based on browser, location), (2) sample ratio mismatch (SRM) indicates bucketing bugs, (3) interference between units (users in one group affect others), or (4) selection bias (users self-select into groups).

What it interviewer is testing

Understanding of causal inference fundamentals and ability to diagnose when experiments lose validity.

Common mistakes

×Confusing correlation with causation
×Not checking for SRM
×Ignoring network effects and interference
×Forgetting that time-based splits aren't randomized

Mini drill

Your test shows control: 50.1% of users, treatment: 49.9%. Is this a problem?

When to escalate / pause

Pause the test if SRM is detected (p < 0.01 on chi-square test). Results are not trustworthy.

Check for SRM in your test

Explain p-values vs confidence intervals (and what each does NOT mean)

One-minute answer

A p-value is the probability of seeing your data (or more extreme) if the null hypothesis were true. It does NOT tell you the probability that the null is true. A 95% confidence interval gives a range where the true effect likely falls. It does NOT mean there's a 95% chance the true value is in that range for THIS specific interval—it means 95% of such intervals would contain the true value. Common trap: p < 0.05 does not mean a 95% chance of being right.

What it interviewer is testing

Deep understanding of frequentist inference and ability to avoid common misinterpretations that plague stakeholders.

Common mistakes

×Saying 'p = 0.03 means 97% chance it's real'
×Treating 95% CI as a probability statement about this interval
×Forgetting that statistical significance ≠ practical significance
×Not reporting effect sizes alongside p-values

Mini drill

Your test shows p = 0.04 with CI [+0.2%, +3.1%]. Stakeholder asks: 'So there's a 96% chance this works?' How do you respond?

When to escalate / pause

If p-value is significant but CI overlaps with zero or your minimum effect of interest, investigate sample size and practical significance before shipping.

Practice p-values and CIs

Type I/II errors, power, MDE: how do you pick targets in practice?

One-minute answer

Type I error (false positive, α) is rejecting a true null—typically set at 5%. Type II error (false negative, β) is missing a real effect—power = 1-β, typically 80%. MDE (Minimum Detectable Effect) is the smallest effect your test can reliably find. In practice: set α = 0.05 (or 0.01 for high-risk), power = 80%, then choose MDE based on business needs and traffic constraints. Smaller MDE requires exponentially more sample size.

What it interviewer is testing

Practical understanding of the power analysis trade-off triangle and ability to make business-grounded test design decisions.

Common mistakes

×Confusing MDE with MEI (Minimum Effect of Interest)
×Not running power analysis before launch
×Using default settings without understanding trade-offs
×Ignoring that smaller MDE needs 4x traffic for 2x sensitivity

Mini drill

You have 50K weekly visitors and a 2% baseline conversion rate. Marketing wants to detect a 0.5% absolute lift. What do you say?

When to escalate / pause

Refuse to launch tests with <50% power unless it's explicitly an exploratory learning experiment with documented low-power caveat.

Calculate required sample size

Practical vs statistical significance: how do you decide to ship?

One-minute answer

Statistical significance (p < 0.05) tells you an effect is likely real. Practical significance asks: is it large enough to matter? A test can be statistically significant but practically irrelevant (0.01% lift with millions of users) or vice versa (10% lift in an underpowered pilot). Decision rule: check if the entire confidence interval exceeds your Minimum Effect of Interest (MEI). If the lower bound of the CI is above MEI, ship. If it overlaps, consider trade-offs.

What it interviewer is testing

Ability to bridge statistics and business impact, and avoid shipping changes that are 'statistically significant but who cares.'

Common mistakes

×Shipping any p < 0.05 result without checking effect size
×Not defining MEI before the test
×Ignoring confidence intervals in decision-making
×Forgetting to check guardrail metrics

Mini drill

Test shows +1.2% conversion lift, p = 0.02, CI [+0.3%, +2.1%]. Your MEI is +1%. Ship?

When to escalate / pause

If guardrails show harm or lower CI bound is negative, do not ship even if average effect looks good.

Read practical significance guide

What assumptions sit behind t-tests/z-tests, and how do you check them?

One-minute answer

Z-test/t-test assumptions: (1) independence of observations, (2) approximate normality of sampling distribution (Central Limit Theorem helps with large n), (3) equal variance (for two-sample tests). Check: (1) no carryover effects between users, (2) sample size >30 per group for CLT, (3) variance ratio <3:1, or use Welch's t-test for unequal variance. For conversion rates, binomial proportions test is more appropriate than t-test.

What it interviewer is testing

Technical depth: knowing when standard tests apply and when to reach for alternatives.

Common mistakes

×Applying t-test to heavily skewed metrics (revenue)
×Not accounting for user-level clustering
×Ignoring variance differences between groups
×Using t-test for conversion rates instead of proportions test

Mini drill

You're testing a pricing change. Revenue per user is highly skewed (long tail). Standard t-test shows p = 0.09. What do you do?

When to escalate / pause

If metric is heavily skewed or has frequent zeros, flag that standard tests may be unreliable and propose robust alternatives.

Analyze test with proper methods

Questions 6-10: Test Design

Choosing metrics: north star vs guardrails; what's a good guardrail strategy?

One-minute answer

North star metric aligns to long-term business goal (revenue, retention, active users). Primary metric for the test is the most sensitive proxy you can measure in the test window. Guardrails protect against unintended harm: track 3-5 counter metrics like engagement, quality, latency, or downstream conversion. Use non-inferiority testing for guardrails—don't just check 'not significant,' prove they're not worse by >X%.

What it interviewer is testing

Product sense and understanding of metric trade-offs, plus ability to design defensive experiments.

Common mistakes

×Optimizing a local metric that doesn't ladder up to business goals
×Not defining guardrails upfront
×Treating 'not statistically significant harm' as proof of no harm
×Picking too many primary metrics (multiple testing problem)

Mini drill

You're testing an aggressive paywall. Primary metric: subscriptions. What guardrails would you choose and why?

When to escalate / pause

If guardrails show statistically significant harm beyond your non-inferiority margin, halt or iterate even if primary metric looks good.

Learn guardrail metrics

Randomization unit (user vs session vs device) and interference/spillovers

One-minute answer

Randomization unit depends on the intervention and metric. User-level for persistent changes (UI redesign, pricing). Session-level for ephemeral tests (search ranking). Device-level if users share devices. Interference happens when treatment affects control: network effects (social features), two-sided markets (Uber drivers see both rider groups), or SEO (organic traffic shifts). Violates SUTVA; requires cluster randomization or switchback designs.

What it interviewer is testing

Experimental design sophistication and awareness of when standard A/B tests break down.

Common mistakes

×Using session-level randomization for features that need consistency
×Ignoring spillover effects in marketplace or social products
×Not recognizing when network effects invalidate user-level tests
×Forgetting cookie deletion and device switching

Mini drill

You're testing a new driver incentive in a rideshare app. User-level randomization shows +8% driver hours. Is this causal?

When to escalate / pause

If you suspect spillover effects (marketplace, social, shared resources), flag that standard A/B tests may overestimate impact and propose alternative designs.

Plan your randomization unit

Stratification and balance: when do you stratify, and on what?

One-minute answer

Stratified randomization ensures balance on important covariates by randomizing within strata (e.g., new vs returning users). Reduces variance and protects against bad randomization luck in small samples. Stratify on: (1) high-correlation predictors of your outcome, (2) variables you'll analyze by (segments), (3) time (day of week). Don't stratify on post-treatment variables. Check balance with covariate balance tests; imbalance >5% on key vars is a red flag.

What it interviewer is testing

Understanding of variance reduction techniques and how to improve experimental precision without increasing sample size.

Common mistakes

×Not stratifying when you have strong predictors available
×Stratifying on too many variables (creating tiny strata)
×Forgetting to check balance post-randomization
×Not adjusting analysis for stratification variables (losing efficiency gains)

Mini drill

You have 10K users, 80% new, 20% returning. Returning users convert 5x higher. You randomize and get 51% returning in treatment, 49% in control. Problem?

When to escalate / pause

If balance check shows >5% difference in key predictors, adjust analysis or re-randomize if still early in test.

Account for stratification in power

Variance reduction: what is CUPED, when does it help, and what can go wrong?

One-minute answer

CUPED (Controlled-experiment Using Pre-Experiment Data) reduces variance by adjusting for pre-treatment covariates, making tests more sensitive. Increases power 10-50% depending on correlation. Works by subtracting predicted outcome based on pre-period behavior. Helps most when: (1) strong correlation between pre/post metrics, (2) large samples. Can go wrong: (1) using post-treatment covariates (biased), (2) overfitting adjustment model, (3) computation errors.

What it interviewer is testing

Knowledge of modern variance reduction techniques used at scale-up companies (Airbnb, Netflix, Uber).

Common mistakes

×Using post-treatment variables in CUPED (e.g., conditioning on post-treatment engagement)
×Not validating that CUPED reduced variance in practice
×Forgetting that CUPED requires pre-period data
×Not explaining the method to stakeholders (trust issues)

Mini drill

Your test is underpowered (50% power). Can CUPED help? Your metric is purchase conversion; you have 30 days of pre-period conversion data with 0.6 correlation.

When to escalate / pause

If implementing CUPED for the first time, validate on A/A tests and past experiments to ensure no bias is introduced.

Recalculate with variance reduction

Duration planning: ramp-ups, seasonality, novelty effects, and why 'run 1 week' is often wrong

One-minute answer

Test duration needs: (1) sufficient sample size (power analysis), (2) full business cycles (1-2 weeks to smooth day-of-week effects), (3) time for novelty to wear off (users react differently to new things initially), (4) avoidance of seasonality. Ramp-up (5% → 50%) helps catch bugs but delays learning. Novelty effects mean week 1 ≠ steady state. Always run at least 1-2 full weeks; for behavior changes, 2-4 weeks is safer.

What it interviewer is testing

Practical judgment about test validity over time and awareness of temporal threats to validity.

Common mistakes

×Stopping at 1 week because you hit significance
×Not accounting for day-of-week effects
×Ignoring novelty effects (especially for UI changes)
×Running during holiday periods without adjustment

Mini drill

Your redesign shows +12% conversion after 3 days with p < 0.01. Product wants to ship Monday. What do you say?

When to escalate / pause

If week 1 and week 2+ show opposite directions, suspect novelty effect. Run longer or use week 2+ data only.

Calculate test duration

Questions 11-15: Analysis & Interpretation

Sample Ratio Mismatch (SRM): what it is, how you detect it, and what you do next

One-minute answer

SRM occurs when your observed traffic split differs from expected (e.g., expecting 50/50, getting 48/52). Indicates randomization bugs, bot traffic, or tracking failures. Detect: chi-square test on user counts (p < 0.01 = problem). Do NOT trust results if SRM exists. Next steps: (1) investigate bucketing logic, (2) check for browser/device differences, (3) look for bots, (4) verify tracking fires equally. Fix and re-run; don't try to 'adjust' for SRM.

What it interviewer is testing

Operational rigor and ability to recognize when an experiment is fundamentally broken before investing in interpretation.

Common mistakes

×Ignoring small SRM because 'it's close to 50/50'
×Trying to adjust for SRM in analysis instead of fixing root cause
×Not checking SRM routinely
×Shipping results despite SRM

Mini drill

You have 500K users: 248K control, 252K treatment. Is this SRM? If so, how bad?

When to escalate / pause

If SRM is detected, immediately halt test analysis, investigate root cause, fix, and relaunch. Do not proceed with analysis.

Check for SRM

Multiple comparisons: many metrics / many segments / many experiments—how do you control false positives?

One-minute answer

Testing multiple hypotheses inflates false positive rate. With 20 comparisons at α=0.05, you expect one false positive. Solutions: (1) pre-specify one primary metric (others are secondary), (2) Bonferroni correction (divide α by # comparisons—conservative), (3) Holm-Bonferroni (less conservative), (4) Benjamini-Hochberg FDR control (allows some false positives, controls rate), (5) hierarchical testing (test primary first; if sig, test secondaries).

What it interviewer is testing

Statistical sophistication in managing experiment portfolios and avoiding false discovery.

Common mistakes

×Testing 20 metrics and highlighting the 'winner' without correction
×Not pre-specifying primary metric
×Using Bonferroni when it's too conservative (low power)
×Slicing into 50 segments post-hoc and reporting significant ones

Mini drill

You test 5 metrics: conversion, revenue, engagement, retention, NPS. One shows p=0.04, others >0.2. Can you claim a win?

When to escalate / pause

If stakeholders are cherry-picking significant metrics post-hoc, educate on multiple comparisons and require pre-specified primary metric for future tests.

Adjust for multiple comparisons

Peeking and sequential testing: why repeated checks inflate false alarms, and what methods let you monitor safely

One-minute answer

Peeking (checking results repeatedly and stopping when significant) inflates false positive rate from 5% to 20-40%. Problem: each peek is a new opportunity to see random noise. Sequential testing methods allow continuous monitoring: (1) Sequential Probability Ratio Test (SPRT), (2) group sequential designs with alpha spending, (3) Bayesian approaches, (4) always-valid inference. These maintain valid error rates but require different math and boundaries.

What it interviewer is testing

Deep understanding of optional stopping problem and knowledge of modern solutions used at data-driven companies.

Common mistakes

×Peeking at results daily and stopping when p<0.05
×Not pre-committing to sample size or stopping rule
×Using sequential methods without understanding alpha spending
×Claiming 'we only looked once' when everyone peeked

Mini drill

Your experimentation platform shows a real-time p-value that crosses 0.05 on day 3 of a planned 14-day test. What do you do?

When to escalate / pause

If your org has a culture of daily peeking, propose implementing sequential testing framework or strict 'no peeking' norms with blinded dashboards.

Set proper stopping rules

Ratio metrics (CTR, conversion rate, revenue/user): pitfalls, variance estimation choices, and when to bootstrap

One-minute answer

Ratio metrics (numerator/denominator) have complex variance: variance of ratio ≠ ratio of variances. Delta method or bootstrap needed for CIs. Pitfalls: (1) correlation between numerator and denominator, (2) heavy skew, (3) small denominators (instability). Bootstrap resamples your data to estimate CI non-parametrically—good for skewed data, small samples, or complex metrics. Use when standard formulas break down.

What it interviewer is testing

Technical depth on metric construction and awareness of when simple methods fail.

Common mistakes

×Using normal approximation for revenue/user (heavy skew)
×Ignoring correlation structure in ratio metrics
×Not capping extreme values in revenue metrics
×Using wrong denominator (impressions vs users vs sessions)

Mini drill

You're testing a pricing change. Revenue/user is very skewed (1% of users = 80% of revenue). Standard t-test shows p=0.12. What do you do?

When to escalate / pause

If your metric is a ratio with heavy skew and standard tests give borderline results, propose robust alternatives before making ship decision.

Analyze ratio metrics properly

Heterogeneous treatment effects: when you can trust segment lifts, and how you avoid 'storytime analytics'

One-minute answer

HTE means treatment effect varies by subgroup (mobile vs desktop, new vs returning). Pre-specified HTE analysis is valid; post-hoc hunting for 'who it worked for' causes false discovery (many segments × multiple testing). Trust segment lifts when: (1) pre-specified in hypothesis, (2) powered for subgroup analysis, (3) interaction test is significant. Avoid storytime: don't slice 50 ways and report the one significant segment without correction.

What it interviewer is testing

Discipline in distinguishing confirmatory analysis from exploratory data mining, and understanding of interaction effects.

Common mistakes

×Slicing by every available dimension and reporting 'it worked for mobile users in California'
×Not testing for interaction effects (just comparing p-values across segments)
×Not adjusting for multiple comparisons in subgroup analysis
×Claiming causality for exploratory findings

Mini drill

Your test shows no overall effect (p=0.6), but when you slice by device, mobile shows +8% (p=0.03). Can you ship mobile-only?

When to escalate / pause

If you find strong subgroup effects post-hoc, label as exploratory and require replication before shipping segment-specific changes.

Test interaction effects

Questions 16-20: Advanced Topics

Non-inferiority/equivalence tests for guardrails: how to prove 'not worse' rather than 'better'

One-minute answer

Non-inferiority tests prove a new treatment is not meaningfully worse than control. Set a non-inferiority margin (δ): the maximum acceptable decline. Reject null if the confidence interval for the difference is entirely above -δ. Equivalence testing is two-sided: prove the effect is within [-δ, +δ]. Used for guardrails when simplifying features or switching vendors. Requires smaller sample size than superiority test for same margin.

What it interviewer is testing

Knowledge of alternative testing frameworks appropriate for defensive or guardrail metrics.

Common mistakes

×Using standard null hypothesis test and concluding 'not significant = not worse'
×Not pre-specifying the non-inferiority margin
×Setting margin too wide (anything goes) or too narrow (impossible to prove)
×Confusing non-inferiority with equivalence

Mini drill

You're migrating to a cheaper CDN. Page load time is a guardrail. Current avg: 2.0s. Acceptable degradation: +0.2s. After test: new CDN = 2.1s, 95% CI [-0.05s, +0.25s]. Ship?

When to escalate / pause

If rolling out a cost-saving or simplification change, require non-inferiority tests on key guardrails—don't rely on 'not statistically significant harm.'

Learn non-inferiority testing

Bayesian vs frequentist A/B: decision rules you'd actually use (and how you'd explain them)

One-minute answer

Frequentist: p-values, fixed sample size, binary decisions (reject/fail to reject). Bayesian: posterior probability that B beats A, credible intervals, allows sequential stopping without penalty. Bayesian decision rule: ship if P(B > A) > 95% AND expected loss if wrong < $X. Easier to explain to stakeholders ('87% chance of winning') but requires specifying prior beliefs. Both are valid; choice depends on organizational culture and infrastructure.

What it interviewer is testing

Conceptual understanding of both paradigms and pragmatic judgment about when each is more appropriate.

Common mistakes

×Claiming Bayesian 'solves' the peeking problem (it shifts the burden to prior and stopping rule)
×Using uninformative priors and claiming they're objective
×Not explaining prior choice to stakeholders
×Mixing frequentist and Bayesian inference (e.g., Bayesian CI with frequentist α)

Mini drill

Your Bayesian test shows P(B > A) = 94%, expected lift = +2.1%, expected loss if wrong = $500. Your prior was flat. Ship?

When to escalate / pause

If stakeholders are confused by p-values and CIs, consider Bayesian reporting ('X% chance of improvement') but ensure proper infrastructure and prior specification.

Try Bayesian calculator

Cluster/geo experiments: why independence fails and how you'd design around it

One-minute answer

Cluster experiments randomize groups of users (cities, schools, time periods) instead of individuals. Needed when individual randomization causes spillover (marketplace effects, cannibalization, infrastructure changes). Problem: observations within clusters are correlated, violating independence. Solution: (1) randomize enough clusters (>20-30), (2) use cluster-robust standard errors, (3) account for cluster size in analysis. Power is driven by number of clusters, not individuals.

What it interviewer is testing

Sophisticated experimental design for settings where standard A/B tests don't work.

Common mistakes

×Randomizing a few large clusters (e.g., 5 cities) and claiming valid inference
×Not accounting for clustering in analysis (standard errors too small)
×Ignoring that power depends on # clusters, not # users
×Confusing cluster randomization with stratified randomization

Mini drill

You're testing a TV ad campaign's effect on app installs. You randomize 8 DMAs (4 treatment, 4 control). 10M people total. Is this well-powered?

When to escalate / pause

If you have <15-20 clusters, flag low power and propose alternative designs (synthetic control, panel regression, or wait for more clusters).

Calculate clustered sample size

Overlapping experiments and interaction risk: how platforms and analysts mitigate collisions

One-minute answer

When running multiple concurrent tests, users may be in several experiments at once. Risk: interactions between experiments (Test A affects Test B's results). Mitigation: (1) orthogonal bucketing (independent randomization), (2) interaction detection (compare test results between users in 0, 1, or 2+ tests), (3) reserving % of traffic for single-test users, (4) experiment monitoring for collisions. Platforms like Google and Meta run thousands of concurrent tests with careful orchestration.

What it interviewer is testing

Operational sophistication and understanding of scaled experimentation programs.

Common mistakes

×Assuming orthogonal randomization eliminates all interaction risk (it doesn't)
×Not monitoring for interactions between high-traffic tests
×Running related tests (e.g., two homepage tests) concurrently without coordination
×Not documenting overlapping tests

Mini drill

You're running Test A (homepage CTA color) and Test B (pricing page headline). User X is in both. Test A shows +5% lift, B shows +3%. Can you estimate the combined effect as +8%?

When to escalate / pause

If two related experiments (same surface area, same metric) are running concurrently and show unexpected results, check for interaction before shipping either.

Learn about experimentation platforms

Program-level impact: how you estimate experimentation returns over many launches, and why holdouts matter

One-minute answer

Not all significant tests hold up post-launch. Selection bias (only winners ship), regression to mean, and interaction effects mean sum of test lifts > actual combined lift. Holdout: keep a small % of users (1-5%) in a 'no experiments' bucket long-term. Compare holdout vs fully experimented users to estimate true program-level impact. Quantifies experimentation ROI and catches when tests mislead. Companies like Netflix and Spotify use this to validate their experimentation systems.

What it interviewer is testing

Strategic thinking about experimentation programs as a whole, not just individual tests.

Common mistakes

×Summing all test lifts and claiming that's the total value generated
×Not running holdouts (can't validate program impact)
×Making holdout too small (<1%) or too large (>10%)
×Not accounting for opportunity cost of holdout

Mini drill

You ran 50 tests last year, 30 showed significant positive effects (avg +2% conversion each). You sum to +60% total lift. Your actual annual conversion growth: +8%. What happened?

When to escalate / pause

If your org claims 'experiments drove 50% growth' but has no holdout validation, propose implementing a long-term holdout to measure true program-level impact.

Calculate experimentation ROI

Scenario Drills: Practice Like Real Interviews

Interviews test your ability to work through ambiguous, multi-step problems under time pressure. These scenario drills simulate real interview questions. Try answering each one aloud before reading the ideal answer.

Design: Pricing page test

You're a PM at a SaaS company. The pricing page shows 3 tiers: Free, Pro ($29/mo), Enterprise (custom). Hypothesis: Adding a mid-tier 'Team' plan at $19/mo will increase overall revenue. Design the A/B test: units, randomization, primary metric, guardrails, MDE, sample size.

Debug: Why did my test fail?

Your homepage redesign test ran for 2 weeks. Treatment shows -2% conversion (p=0.08, not significant), but your hypothesis was strong and user research supported the change. Week 1 showed +5%, week 2 showed -8%. What might explain this? How do you debug?

Decide: Conflicting signals

Your aggressive onboarding flow test shows: Primary metric (activation): +12% (p<0.001), Guardrail 1 (time to first action): -18% (p<0.001, faster is good), Guardrail 2 (day 7 retention): -4% (p=0.06, not significant but trending negative). Cost to implement: 2 eng-weeks. Do you ship? Why or why not?

Communicate: Explain to stakeholders

Your CEO sees that a test hit statistical significance after 3 days and wants to ship immediately to 'capture the win before competitors copy.' The test was planned for 14 days. You know peeking inflates false positives. How do you explain this to a non-technical executive in 30 seconds?

Design: Marketplace experiment

You're at a rideshare company testing a new driver surge pricing algorithm. Standard user-level A/B test isn't appropriate because drivers and riders interact across groups. How would you design this experiment?

Analyze: Segment deep-dive

Your test shows no overall effect (p=0.5), but post-hoc you notice: Mobile iOS: +15% (p=0.02), Mobile Android: +2% (p=0.7), Desktop: -8% (p=0.04). What do you conclude? Can you ship iOS-only?

Debug: Sample Ratio Mismatch

Your experiment shows 53% of users in treatment, 47% in control (n=100K total). Conversion rate shows treatment +8% (p=0.01). Do you trust this result? What do you check?

Decide: Underpowered result

You ran a test for 2 weeks (all the traffic you could get). Result: +3.2% lift, p=0.12 (not significant), 95% CI [-0.8%, +7.2%]. Your MEI is +2%. Implementing costs 1 eng-week. Ship or not?

Cheat Sheets: Quick Reference

Red Flags: When to Pause or Reject Results

✕Peeking at results before hitting sample size target

✕No pre-specified hypothesis or success criteria

✕Underpowered test (power <50%)

✕Testing multiple changes simultaneously (confounded)

✕Primary metric misaligned with business goal

✕No guardrail metrics defined

✕Including unaffected users (traffic dilution)

✕Running during high-variance periods (holidays) without adjustment

✕Sample Ratio Mismatch (SRM) detected but ignored

✕Stopping early or extending indefinitely based on results (optional stopping)

✕Cherry-picking winning metrics or segments post-hoc without correction

✕Claiming significance without checking practical significance

✕Novelty effects not considered (stopping after 1-3 days)

✕No documentation of hypothesis, design, or learnings

✕Using t-test on heavily skewed metrics without validation

Formula Cheat Sheet

Statistical Significance (Z-test for proportions)

z = (p₁ - p₂) / √[p(1-p)(1/n₁ + 1/n₂)]

When: Comparing two conversion rates with large samples

Interpret: If |z| > 1.96, reject null at α=0.05 (two-tailed)

Sample Size (per variant)

n = 2(Z_α + Z_β)² × p(1-p) / (MDE)²

When: Planning test before launch (power analysis)

Interpret: Smaller MDE requires exponentially more traffic

Confidence Interval for difference in proportions

CI = (p₁ - p₂) ± Z × √[p₁(1-p₁)/n₁ + p₂(1-p₂)/n₂]

When: Estimating uncertainty around lift size

Interpret: If CI excludes 0, effect is statistically significant

Minimum Detectable Effect (MDE)

MDE = (Z_α + Z_β) × √[2p(1-p)/n]

When: Understanding sensitivity of your test design

Interpret: The smallest true effect you have good chance of detecting

Sample Ratio Mismatch (Chi-square)

χ² = Σ(observed - expected)² / expected

When: Checking if traffic split matches intended randomization

Interpret: If p < 0.01, reject null—you have SRM, investigate

Key Takeaways for Interview Success

Think like a practitioner, not a textbook

Interviewers want to see practical judgment: 'When would you escalate?', 'How would you debug this?', 'What would you check first?' Study the red flags and scenario drills, not just formulas.

Master the modern topics

SRM, sequential testing, CUPED, guardrails, and holdouts separate 2026 candidates from those stuck in 2015. If you don't know these, you'll struggle at scale-up companies (Airbnb, Netflix, Uber, Stripe).

Bridge stats and business

Every technical answer should connect to impact: 'Why does this matter?' Practice explaining p-values, confidence intervals, and power to non-technical stakeholders in under 30 seconds.

Know when you don't know

If asked about cluster randomization or CUPED and you're unfamiliar, say so and show how you'd learn: 'I haven't used CUPED, but I understand variance reduction concepts. I'd start with...' Honesty > BS.

References

Kohavi, R., Tang, D., & Xu, Y. (2020). Trustworthy Online Controlled Experiments: A Practical Guide to A/B Testing. Cambridge University Press. The definitive reference for modern experimentation platforms and practices.
Fabijan, A., et al. (2019). Diagnosing Sample Ratio Mismatch in Online Controlled Experiments: A Taxonomy and Rules of Thumb for Practitioners. Proceedings of the 25th ACM SIGKDD Conference on Knowledge Discovery and Data Mining.PDF Link
Deng, A., Xu, Y., Kohavi, R., & Walker, T. (2013). Improving the Sensitivity of Online Controlled Experiments by Utilizing Pre-Experiment Data. Proceedings of the 6th ACM International Conference on Web Search and Data Mining (WSDM). CUPED introduction and variance reduction techniques.
Johari, R., Li, L., Weintraub, G., & Ramdas, A. (2022). Sequential Testing for A/B Tests. arXiv:2206.09090.PDF LinkModern sequential testing methods that allow peeking without inflating error rates.
Demyr, S. (2024). Estimating the Returns to Experimentation: Evidence from Holdback Tests.PDF LinkProgram-level impact and holdout methodology.
Spotify Engineering. (2020). Guardrail Metrics: How to Protect Your Experiments from Hidden Harm. Practical guidance on counter metrics and non-inferiority testing.
Statsig Blog. (2024). Introducing Stratified Sampling.LinkModern implementation of stratification in experimentation platforms.
Nubank Engineering. (2022). 3 Lessons from Implementing CUPED at Nubank.LinkReal-world variance reduction case study.

Share this article

Found this useful for your interview prep?

Share this guide with others preparing for A/B testing interviews.

Related Resources

Significance Calculator

Analyze your experiment results.

Sample Size Calculator

Plan experiments with proper power analysis.

MDE Calculator

Determine your minimum detectable effect.

Practical Significance

Learn when a result actually matters.

A/B Testing Mistakes

Avoid costly errors with charts and checklist.

Glossary

100+ experimentation terms explained.

All 20 Questions & Answers

Practice with Real Calculations

Use our free calculators to work through these interview questions with actual numbers. Understanding the calculations behind the concepts will help you ace your interviews.

Try Significance Calculator

Plan Your Sample Size

Key Takeaways for Interview Success

Think like a practitioner, not a textbook

Interviewers want to see practical judgment: 'When would you escalate?', 'How would you debug this?', 'What would you check first?' Study the red flags and scenario drills, not just formulas.

Master the modern topics

Bridge stats and business

Every technical answer should connect to impact: 'Why does this matter?' Practice explaining p-values, confidence intervals, and power to non-technical stakeholders in under 30 seconds.

Know when you don't know

References

Kohavi, R., Tang, D., & Xu, Y. (2020). Trustworthy Online Controlled Experiments: A Practical Guide to A/B Testing. Cambridge University Press. The definitive reference for modern experimentation platforms and practices.
Fabijan, A., et al. (2019). Diagnosing Sample Ratio Mismatch in Online Controlled Experiments: A Taxonomy and Rules of Thumb for Practitioners. Proceedings of the 25th ACM SIGKDD Conference on Knowledge Discovery and Data Mining.PDF Link
Deng, A., Xu, Y., Kohavi, R., & Walker, T. (2013). Improving the Sensitivity of Online Controlled Experiments by Utilizing Pre-Experiment Data. Proceedings of the 6th ACM International Conference on Web Search and Data Mining (WSDM). CUPED introduction and variance reduction techniques.
Johari, R., Li, L., Weintraub, G., & Ramdas, A. (2022). Sequential Testing for A/B Tests. arXiv:2206.09090.PDF LinkModern sequential testing methods that allow peeking without inflating error rates.
Demyr, S. (2024). Estimating the Returns to Experimentation: Evidence from Holdback Tests.PDF LinkProgram-level impact and holdout methodology.
Spotify Engineering. (2020). Guardrail Metrics: How to Protect Your Experiments from Hidden Harm. Practical guidance on counter metrics and non-inferiority testing.
Statsig Blog. (2024). Introducing Stratified Sampling.LinkModern implementation of stratification in experimentation platforms.
Nubank Engineering. (2022). 3 Lessons from Implementing CUPED at Nubank.LinkReal-world variance reduction case study.

All 20 Questions & Answers

20 A/B Testing Interview Questions for 2026

Questions 1-5: Fundamentals

What makes an A/B test causal, and what breaks causality?

One-minute answer

What it interviewer is testing

Common mistakes

Mini drill

View answer

Likely follow-up questions (4)

When to escalate / pause

Explain p-values vs confidence intervals (and what each does NOT mean)

One-minute answer

What it interviewer is testing

Common mistakes

Mini drill

View answer

Likely follow-up questions (4)

When to escalate / pause

Type I/II errors, power, MDE: how do you pick targets in practice?

One-minute answer

What it interviewer is testing

Common mistakes

Mini drill

View answer

Likely follow-up questions (4)

When to escalate / pause

Practical vs statistical significance: how do you decide to ship?

One-minute answer

What it interviewer is testing

Common mistakes

Mini drill

View answer

Likely follow-up questions (4)

When to escalate / pause

What assumptions sit behind t-tests/z-tests, and how do you check them?

One-minute answer

What it interviewer is testing

Common mistakes

Mini drill

View answer

Likely follow-up questions (4)

When to escalate / pause

Questions 6-10: Test Design

Choosing metrics: north star vs guardrails; what's a good guardrail strategy?

One-minute answer

What it interviewer is testing

Common mistakes

Mini drill

View answer

Likely follow-up questions (4)

When to escalate / pause

Randomization unit (user vs session vs device) and interference/spillovers

One-minute answer

What it interviewer is testing

Common mistakes

Mini drill

View answer

Likely follow-up questions (4)

When to escalate / pause

Stratification and balance: when do you stratify, and on what?

One-minute answer

What it interviewer is testing

Common mistakes

Mini drill

View answer

Likely follow-up questions (4)

When to escalate / pause

Variance reduction: what is CUPED, when does it help, and what can go wrong?

One-minute answer

What it interviewer is testing

Common mistakes

Mini drill

View answer

Likely follow-up questions (4)

When to escalate / pause

Duration planning: ramp-ups, seasonality, novelty effects, and why 'run 1 week' is often wrong

One-minute answer

What it interviewer is testing

Common mistakes

Mini drill