20 A/B Testing Interview Questions for 2026
Scenario-driven questions covering SRM, sequential testing, variance reduction (CUPED), guardrails, and modern experimentation practices. Real answers, practical drills, and what interviewers actually test for.
A/B testing interviews in 2026 demand more than memorizing p-value definitions. Interviewers expect you to design tests from scratch, debug flawed experiments, make ship/no-ship decisions under uncertainty, and communicate trade-offs to non-technical stakeholders. Whether you're interviewing for a data scientist, product manager, or experimentation analyst role, you need scenario-based judgment, not just textbook knowledge.
This guide covers the 20 questions that separate strong candidates from weak ones. Each question includes a one-minute answer, what the interviewer is really testing, common mistakes, a mini drill to practice, likely follow-ups, and when to escalate. After the questions, you'll find scenario drills to simulate real interview challenges and cheat sheets for formulas and red flags.
How to use this guide
Read each question → answer aloud without looking → do the mini drill → review follow-ups → check when to escalate. Practice the scenario drills at the end like real interview case studies. Use the cheat sheets as quick reference material.
Questions 1-5: Fundamentals
What makes an A/B test causal, and what breaks causality?
One-minute answer
An A/B test is causal because of proper randomization: users are randomly assigned to control or treatment, creating statistically equivalent groups. This means any observed difference in outcomes is attributable to the intervention, not pre-existing differences. Causality breaks when: (1) randomization fails (e.g., assignment based on browser, location), (2) sample ratio mismatch (SRM) indicates bucketing bugs, (3) interference between units (users in one group affect others), or (4) selection bias (users self-select into groups).
What it interviewer is testing
Understanding of causal inference fundamentals and ability to diagnose when experiments lose validity.
Common mistakes
- ×Confusing correlation with causation
- ×Not checking for SRM
- ×Ignoring network effects and interference
- ×Forgetting that time-based splits aren't randomized
Mini drill
Your test shows control: 50.1% of users, treatment: 49.9%. Is this a problem?
When to escalate / pause
Pause the test if SRM is detected (p < 0.01 on chi-square test). Results are not trustworthy.
Explain p-values vs confidence intervals (and what each does NOT mean)
One-minute answer
A p-value is the probability of seeing your data (or more extreme) if the null hypothesis were true. It does NOT tell you the probability that the null is true. A 95% confidence interval gives a range where the true effect likely falls. It does NOT mean there's a 95% chance the true value is in that range for THIS specific interval—it means 95% of such intervals would contain the true value. Common trap: p < 0.05 does not mean a 95% chance of being right.
What it interviewer is testing
Deep understanding of frequentist inference and ability to avoid common misinterpretations that plague stakeholders.
Common mistakes
- ×Saying 'p = 0.03 means 97% chance it's real'
- ×Treating 95% CI as a probability statement about this interval
- ×Forgetting that statistical significance ≠ practical significance
- ×Not reporting effect sizes alongside p-values
Mini drill
Your test shows p = 0.04 with CI [+0.2%, +3.1%]. Stakeholder asks: 'So there's a 96% chance this works?' How do you respond?
When to escalate / pause
If p-value is significant but CI overlaps with zero or your minimum effect of interest, investigate sample size and practical significance before shipping.
Type I/II errors, power, MDE: how do you pick targets in practice?
One-minute answer
Type I error (false positive, α) is rejecting a true null—typically set at 5%. Type II error (false negative, β) is missing a real effect—power = 1-β, typically 80%. MDE (Minimum Detectable Effect) is the smallest effect your test can reliably find. In practice: set α = 0.05 (or 0.01 for high-risk), power = 80%, then choose MDE based on business needs and traffic constraints. Smaller MDE requires exponentially more sample size.
What it interviewer is testing
Practical understanding of the power analysis trade-off triangle and ability to make business-grounded test design decisions.
Common mistakes
- ×Confusing MDE with MEI (Minimum Effect of Interest)
- ×Not running power analysis before launch
- ×Using default settings without understanding trade-offs
- ×Ignoring that smaller MDE needs 4x traffic for 2x sensitivity
Mini drill
You have 50K weekly visitors and a 2% baseline conversion rate. Marketing wants to detect a 0.5% absolute lift. What do you say?
When to escalate / pause
Refuse to launch tests with <50% power unless it's explicitly an exploratory learning experiment with documented low-power caveat.
Practical vs statistical significance: how do you decide to ship?
One-minute answer
Statistical significance (p < 0.05) tells you an effect is likely real. Practical significance asks: is it large enough to matter? A test can be statistically significant but practically irrelevant (0.01% lift with millions of users) or vice versa (10% lift in an underpowered pilot). Decision rule: check if the entire confidence interval exceeds your Minimum Effect of Interest (MEI). If the lower bound of the CI is above MEI, ship. If it overlaps, consider trade-offs.
What it interviewer is testing
Ability to bridge statistics and business impact, and avoid shipping changes that are 'statistically significant but who cares.'
Common mistakes
- ×Shipping any p < 0.05 result without checking effect size
- ×Not defining MEI before the test
- ×Ignoring confidence intervals in decision-making
- ×Forgetting to check guardrail metrics
Mini drill
Test shows +1.2% conversion lift, p = 0.02, CI [+0.3%, +2.1%]. Your MEI is +1%. Ship?
When to escalate / pause
If guardrails show harm or lower CI bound is negative, do not ship even if average effect looks good.
What assumptions sit behind t-tests/z-tests, and how do you check them?
One-minute answer
Z-test/t-test assumptions: (1) independence of observations, (2) approximate normality of sampling distribution (Central Limit Theorem helps with large n), (3) equal variance (for two-sample tests). Check: (1) no carryover effects between users, (2) sample size >30 per group for CLT, (3) variance ratio <3:1, or use Welch's t-test for unequal variance. For conversion rates, binomial proportions test is more appropriate than t-test.
What it interviewer is testing
Technical depth: knowing when standard tests apply and when to reach for alternatives.
Common mistakes
- ×Applying t-test to heavily skewed metrics (revenue)
- ×Not accounting for user-level clustering
- ×Ignoring variance differences between groups
- ×Using t-test for conversion rates instead of proportions test
Mini drill
You're testing a pricing change. Revenue per user is highly skewed (long tail). Standard t-test shows p = 0.09. What do you do?
When to escalate / pause
If metric is heavily skewed or has frequent zeros, flag that standard tests may be unreliable and propose robust alternatives.
Questions 6-10: Test Design
Choosing metrics: north star vs guardrails; what's a good guardrail strategy?
One-minute answer
North star metric aligns to long-term business goal (revenue, retention, active users). Primary metric for the test is the most sensitive proxy you can measure in the test window. Guardrails protect against unintended harm: track 3-5 counter metrics like engagement, quality, latency, or downstream conversion. Use non-inferiority testing for guardrails—don't just check 'not significant,' prove they're not worse by >X%.
What it interviewer is testing
Product sense and understanding of metric trade-offs, plus ability to design defensive experiments.
Common mistakes
- ×Optimizing a local metric that doesn't ladder up to business goals
- ×Not defining guardrails upfront
- ×Treating 'not statistically significant harm' as proof of no harm
- ×Picking too many primary metrics (multiple testing problem)
Mini drill
You're testing an aggressive paywall. Primary metric: subscriptions. What guardrails would you choose and why?
When to escalate / pause
If guardrails show statistically significant harm beyond your non-inferiority margin, halt or iterate even if primary metric looks good.
Randomization unit (user vs session vs device) and interference/spillovers
One-minute answer
Randomization unit depends on the intervention and metric. User-level for persistent changes (UI redesign, pricing). Session-level for ephemeral tests (search ranking). Device-level if users share devices. Interference happens when treatment affects control: network effects (social features), two-sided markets (Uber drivers see both rider groups), or SEO (organic traffic shifts). Violates SUTVA; requires cluster randomization or switchback designs.
What it interviewer is testing
Experimental design sophistication and awareness of when standard A/B tests break down.
Common mistakes
- ×Using session-level randomization for features that need consistency
- ×Ignoring spillover effects in marketplace or social products
- ×Not recognizing when network effects invalidate user-level tests
- ×Forgetting cookie deletion and device switching
Mini drill
You're testing a new driver incentive in a rideshare app. User-level randomization shows +8% driver hours. Is this causal?
When to escalate / pause
If you suspect spillover effects (marketplace, social, shared resources), flag that standard A/B tests may overestimate impact and propose alternative designs.
Stratification and balance: when do you stratify, and on what?
One-minute answer
Stratified randomization ensures balance on important covariates by randomizing within strata (e.g., new vs returning users). Reduces variance and protects against bad randomization luck in small samples. Stratify on: (1) high-correlation predictors of your outcome, (2) variables you'll analyze by (segments), (3) time (day of week). Don't stratify on post-treatment variables. Check balance with covariate balance tests; imbalance >5% on key vars is a red flag.
What it interviewer is testing
Understanding of variance reduction techniques and how to improve experimental precision without increasing sample size.
Common mistakes
- ×Not stratifying when you have strong predictors available
- ×Stratifying on too many variables (creating tiny strata)
- ×Forgetting to check balance post-randomization
- ×Not adjusting analysis for stratification variables (losing efficiency gains)
Mini drill
You have 10K users, 80% new, 20% returning. Returning users convert 5x higher. You randomize and get 51% returning in treatment, 49% in control. Problem?
When to escalate / pause
If balance check shows >5% difference in key predictors, adjust analysis or re-randomize if still early in test.
Variance reduction: what is CUPED, when does it help, and what can go wrong?
One-minute answer
CUPED (Controlled-experiment Using Pre-Experiment Data) reduces variance by adjusting for pre-treatment covariates, making tests more sensitive. Increases power 10-50% depending on correlation. Works by subtracting predicted outcome based on pre-period behavior. Helps most when: (1) strong correlation between pre/post metrics, (2) large samples. Can go wrong: (1) using post-treatment covariates (biased), (2) overfitting adjustment model, (3) computation errors.
What it interviewer is testing
Knowledge of modern variance reduction techniques used at scale-up companies (Airbnb, Netflix, Uber).
Common mistakes
- ×Using post-treatment variables in CUPED (e.g., conditioning on post-treatment engagement)
- ×Not validating that CUPED reduced variance in practice
- ×Forgetting that CUPED requires pre-period data
- ×Not explaining the method to stakeholders (trust issues)
Mini drill
Your test is underpowered (50% power). Can CUPED help? Your metric is purchase conversion; you have 30 days of pre-period conversion data with 0.6 correlation.
When to escalate / pause
If implementing CUPED for the first time, validate on A/A tests and past experiments to ensure no bias is introduced.
Duration planning: ramp-ups, seasonality, novelty effects, and why 'run 1 week' is often wrong
One-minute answer
Test duration needs: (1) sufficient sample size (power analysis), (2) full business cycles (1-2 weeks to smooth day-of-week effects), (3) time for novelty to wear off (users react differently to new things initially), (4) avoidance of seasonality. Ramp-up (5% → 50%) helps catch bugs but delays learning. Novelty effects mean week 1 ≠ steady state. Always run at least 1-2 full weeks; for behavior changes, 2-4 weeks is safer.
What it interviewer is testing
Practical judgment about test validity over time and awareness of temporal threats to validity.
Common mistakes
- ×Stopping at 1 week because you hit significance
- ×Not accounting for day-of-week effects
- ×Ignoring novelty effects (especially for UI changes)
- ×Running during holiday periods without adjustment
Mini drill
Your redesign shows +12% conversion after 3 days with p < 0.01. Product wants to ship Monday. What do you say?
When to escalate / pause
If week 1 and week 2+ show opposite directions, suspect novelty effect. Run longer or use week 2+ data only.
Questions 11-15: Analysis & Interpretation
Sample Ratio Mismatch (SRM): what it is, how you detect it, and what you do next
One-minute answer
SRM occurs when your observed traffic split differs from expected (e.g., expecting 50/50, getting 48/52). Indicates randomization bugs, bot traffic, or tracking failures. Detect: chi-square test on user counts (p < 0.01 = problem). Do NOT trust results if SRM exists. Next steps: (1) investigate bucketing logic, (2) check for browser/device differences, (3) look for bots, (4) verify tracking fires equally. Fix and re-run; don't try to 'adjust' for SRM.
What it interviewer is testing
Operational rigor and ability to recognize when an experiment is fundamentally broken before investing in interpretation.
Common mistakes
- ×Ignoring small SRM because 'it's close to 50/50'
- ×Trying to adjust for SRM in analysis instead of fixing root cause
- ×Not checking SRM routinely
- ×Shipping results despite SRM
Mini drill
You have 500K users: 248K control, 252K treatment. Is this SRM? If so, how bad?
When to escalate / pause
If SRM is detected, immediately halt test analysis, investigate root cause, fix, and relaunch. Do not proceed with analysis.
Multiple comparisons: many metrics / many segments / many experiments—how do you control false positives?
One-minute answer
Testing multiple hypotheses inflates false positive rate. With 20 comparisons at α=0.05, you expect one false positive. Solutions: (1) pre-specify one primary metric (others are secondary), (2) Bonferroni correction (divide α by # comparisons—conservative), (3) Holm-Bonferroni (less conservative), (4) Benjamini-Hochberg FDR control (allows some false positives, controls rate), (5) hierarchical testing (test primary first; if sig, test secondaries).
What it interviewer is testing
Statistical sophistication in managing experiment portfolios and avoiding false discovery.
Common mistakes
- ×Testing 20 metrics and highlighting the 'winner' without correction
- ×Not pre-specifying primary metric
- ×Using Bonferroni when it's too conservative (low power)
- ×Slicing into 50 segments post-hoc and reporting significant ones
Mini drill
You test 5 metrics: conversion, revenue, engagement, retention, NPS. One shows p=0.04, others >0.2. Can you claim a win?
When to escalate / pause
If stakeholders are cherry-picking significant metrics post-hoc, educate on multiple comparisons and require pre-specified primary metric for future tests.
Peeking and sequential testing: why repeated checks inflate false alarms, and what methods let you monitor safely
One-minute answer
Peeking (checking results repeatedly and stopping when significant) inflates false positive rate from 5% to 20-40%. Problem: each peek is a new opportunity to see random noise. Sequential testing methods allow continuous monitoring: (1) Sequential Probability Ratio Test (SPRT), (2) group sequential designs with alpha spending, (3) Bayesian approaches, (4) always-valid inference. These maintain valid error rates but require different math and boundaries.
What it interviewer is testing
Deep understanding of optional stopping problem and knowledge of modern solutions used at data-driven companies.
Common mistakes
- ×Peeking at results daily and stopping when p<0.05
- ×Not pre-committing to sample size or stopping rule
- ×Using sequential methods without understanding alpha spending
- ×Claiming 'we only looked once' when everyone peeked
Mini drill
Your experimentation platform shows a real-time p-value that crosses 0.05 on day 3 of a planned 14-day test. What do you do?
When to escalate / pause
If your org has a culture of daily peeking, propose implementing sequential testing framework or strict 'no peeking' norms with blinded dashboards.
Ratio metrics (CTR, conversion rate, revenue/user): pitfalls, variance estimation choices, and when to bootstrap
One-minute answer
Ratio metrics (numerator/denominator) have complex variance: variance of ratio ≠ ratio of variances. Delta method or bootstrap needed for CIs. Pitfalls: (1) correlation between numerator and denominator, (2) heavy skew, (3) small denominators (instability). Bootstrap resamples your data to estimate CI non-parametrically—good for skewed data, small samples, or complex metrics. Use when standard formulas break down.
What it interviewer is testing
Technical depth on metric construction and awareness of when simple methods fail.
Common mistakes
- ×Using normal approximation for revenue/user (heavy skew)
- ×Ignoring correlation structure in ratio metrics
- ×Not capping extreme values in revenue metrics
- ×Using wrong denominator (impressions vs users vs sessions)
Mini drill
You're testing a pricing change. Revenue/user is very skewed (1% of users = 80% of revenue). Standard t-test shows p=0.12. What do you do?
When to escalate / pause
If your metric is a ratio with heavy skew and standard tests give borderline results, propose robust alternatives before making ship decision.
Heterogeneous treatment effects: when you can trust segment lifts, and how you avoid 'storytime analytics'
One-minute answer
HTE means treatment effect varies by subgroup (mobile vs desktop, new vs returning). Pre-specified HTE analysis is valid; post-hoc hunting for 'who it worked for' causes false discovery (many segments × multiple testing). Trust segment lifts when: (1) pre-specified in hypothesis, (2) powered for subgroup analysis, (3) interaction test is significant. Avoid storytime: don't slice 50 ways and report the one significant segment without correction.
What it interviewer is testing
Discipline in distinguishing confirmatory analysis from exploratory data mining, and understanding of interaction effects.
Common mistakes
- ×Slicing by every available dimension and reporting 'it worked for mobile users in California'
- ×Not testing for interaction effects (just comparing p-values across segments)
- ×Not adjusting for multiple comparisons in subgroup analysis
- ×Claiming causality for exploratory findings
Mini drill
Your test shows no overall effect (p=0.6), but when you slice by device, mobile shows +8% (p=0.03). Can you ship mobile-only?
When to escalate / pause
If you find strong subgroup effects post-hoc, label as exploratory and require replication before shipping segment-specific changes.
Questions 16-20: Advanced Topics
Non-inferiority/equivalence tests for guardrails: how to prove 'not worse' rather than 'better'
One-minute answer
Non-inferiority tests prove a new treatment is not meaningfully worse than control. Set a non-inferiority margin (δ): the maximum acceptable decline. Reject null if the confidence interval for the difference is entirely above -δ. Equivalence testing is two-sided: prove the effect is within [-δ, +δ]. Used for guardrails when simplifying features or switching vendors. Requires smaller sample size than superiority test for same margin.
What it interviewer is testing
Knowledge of alternative testing frameworks appropriate for defensive or guardrail metrics.
Common mistakes
- ×Using standard null hypothesis test and concluding 'not significant = not worse'
- ×Not pre-specifying the non-inferiority margin
- ×Setting margin too wide (anything goes) or too narrow (impossible to prove)
- ×Confusing non-inferiority with equivalence
Mini drill
You're migrating to a cheaper CDN. Page load time is a guardrail. Current avg: 2.0s. Acceptable degradation: +0.2s. After test: new CDN = 2.1s, 95% CI [-0.05s, +0.25s]. Ship?
When to escalate / pause
If rolling out a cost-saving or simplification change, require non-inferiority tests on key guardrails—don't rely on 'not statistically significant harm.'
Bayesian vs frequentist A/B: decision rules you'd actually use (and how you'd explain them)
One-minute answer
Frequentist: p-values, fixed sample size, binary decisions (reject/fail to reject). Bayesian: posterior probability that B beats A, credible intervals, allows sequential stopping without penalty. Bayesian decision rule: ship if P(B > A) > 95% AND expected loss if wrong < $X. Easier to explain to stakeholders ('87% chance of winning') but requires specifying prior beliefs. Both are valid; choice depends on organizational culture and infrastructure.
What it interviewer is testing
Conceptual understanding of both paradigms and pragmatic judgment about when each is more appropriate.
Common mistakes
- ×Claiming Bayesian 'solves' the peeking problem (it shifts the burden to prior and stopping rule)
- ×Using uninformative priors and claiming they're objective
- ×Not explaining prior choice to stakeholders
- ×Mixing frequentist and Bayesian inference (e.g., Bayesian CI with frequentist α)
Mini drill
Your Bayesian test shows P(B > A) = 94%, expected lift = +2.1%, expected loss if wrong = $500. Your prior was flat. Ship?
When to escalate / pause
If stakeholders are confused by p-values and CIs, consider Bayesian reporting ('X% chance of improvement') but ensure proper infrastructure and prior specification.
Cluster/geo experiments: why independence fails and how you'd design around it
One-minute answer
Cluster experiments randomize groups of users (cities, schools, time periods) instead of individuals. Needed when individual randomization causes spillover (marketplace effects, cannibalization, infrastructure changes). Problem: observations within clusters are correlated, violating independence. Solution: (1) randomize enough clusters (>20-30), (2) use cluster-robust standard errors, (3) account for cluster size in analysis. Power is driven by number of clusters, not individuals.
What it interviewer is testing
Sophisticated experimental design for settings where standard A/B tests don't work.
Common mistakes
- ×Randomizing a few large clusters (e.g., 5 cities) and claiming valid inference
- ×Not accounting for clustering in analysis (standard errors too small)
- ×Ignoring that power depends on # clusters, not # users
- ×Confusing cluster randomization with stratified randomization
Mini drill
You're testing a TV ad campaign's effect on app installs. You randomize 8 DMAs (4 treatment, 4 control). 10M people total. Is this well-powered?
When to escalate / pause
If you have <15-20 clusters, flag low power and propose alternative designs (synthetic control, panel regression, or wait for more clusters).
Overlapping experiments and interaction risk: how platforms and analysts mitigate collisions
One-minute answer
When running multiple concurrent tests, users may be in several experiments at once. Risk: interactions between experiments (Test A affects Test B's results). Mitigation: (1) orthogonal bucketing (independent randomization), (2) interaction detection (compare test results between users in 0, 1, or 2+ tests), (3) reserving % of traffic for single-test users, (4) experiment monitoring for collisions. Platforms like Google and Meta run thousands of concurrent tests with careful orchestration.
What it interviewer is testing
Operational sophistication and understanding of scaled experimentation programs.
Common mistakes
- ×Assuming orthogonal randomization eliminates all interaction risk (it doesn't)
- ×Not monitoring for interactions between high-traffic tests
- ×Running related tests (e.g., two homepage tests) concurrently without coordination
- ×Not documenting overlapping tests
Mini drill
You're running Test A (homepage CTA color) and Test B (pricing page headline). User X is in both. Test A shows +5% lift, B shows +3%. Can you estimate the combined effect as +8%?
When to escalate / pause
If two related experiments (same surface area, same metric) are running concurrently and show unexpected results, check for interaction before shipping either.
Program-level impact: how you estimate experimentation returns over many launches, and why holdouts matter
One-minute answer
Not all significant tests hold up post-launch. Selection bias (only winners ship), regression to mean, and interaction effects mean sum of test lifts > actual combined lift. Holdout: keep a small % of users (1-5%) in a 'no experiments' bucket long-term. Compare holdout vs fully experimented users to estimate true program-level impact. Quantifies experimentation ROI and catches when tests mislead. Companies like Netflix and Spotify use this to validate their experimentation systems.
What it interviewer is testing
Strategic thinking about experimentation programs as a whole, not just individual tests.
Common mistakes
- ×Summing all test lifts and claiming that's the total value generated
- ×Not running holdouts (can't validate program impact)
- ×Making holdout too small (<1%) or too large (>10%)
- ×Not accounting for opportunity cost of holdout
Mini drill
You ran 50 tests last year, 30 showed significant positive effects (avg +2% conversion each). You sum to +60% total lift. Your actual annual conversion growth: +8%. What happened?
When to escalate / pause
If your org claims 'experiments drove 50% growth' but has no holdout validation, propose implementing a long-term holdout to measure true program-level impact.
Scenario Drills: Practice Like Real Interviews
Interviews test your ability to work through ambiguous, multi-step problems under time pressure. These scenario drills simulate real interview questions. Try answering each one aloud before reading the ideal answer.
Design: Pricing page test
You're a PM at a SaaS company. The pricing page shows 3 tiers: Free, Pro ($29/mo), Enterprise (custom). Hypothesis: Adding a mid-tier 'Team' plan at $19/mo will increase overall revenue. Design the A/B test: units, randomization, primary metric, guardrails, MDE, sample size.
Debug: Why did my test fail?
Your homepage redesign test ran for 2 weeks. Treatment shows -2% conversion (p=0.08, not significant), but your hypothesis was strong and user research supported the change. Week 1 showed +5%, week 2 showed -8%. What might explain this? How do you debug?
Decide: Conflicting signals
Your aggressive onboarding flow test shows: Primary metric (activation): +12% (p<0.001), Guardrail 1 (time to first action): -18% (p<0.001, faster is good), Guardrail 2 (day 7 retention): -4% (p=0.06, not significant but trending negative). Cost to implement: 2 eng-weeks. Do you ship? Why or why not?
Communicate: Explain to stakeholders
Your CEO sees that a test hit statistical significance after 3 days and wants to ship immediately to 'capture the win before competitors copy.' The test was planned for 14 days. You know peeking inflates false positives. How do you explain this to a non-technical executive in 30 seconds?
Design: Marketplace experiment
You're at a rideshare company testing a new driver surge pricing algorithm. Standard user-level A/B test isn't appropriate because drivers and riders interact across groups. How would you design this experiment?
Analyze: Segment deep-dive
Your test shows no overall effect (p=0.5), but post-hoc you notice: Mobile iOS: +15% (p=0.02), Mobile Android: +2% (p=0.7), Desktop: -8% (p=0.04). What do you conclude? Can you ship iOS-only?
Debug: Sample Ratio Mismatch
Your experiment shows 53% of users in treatment, 47% in control (n=100K total). Conversion rate shows treatment +8% (p=0.01). Do you trust this result? What do you check?
Decide: Underpowered result
You ran a test for 2 weeks (all the traffic you could get). Result: +3.2% lift, p=0.12 (not significant), 95% CI [-0.8%, +7.2%]. Your MEI is +2%. Implementing costs 1 eng-week. Ship or not?
Cheat Sheets: Quick Reference
Red Flags: When to Pause or Reject Results
Formula Cheat Sheet
Statistical Significance (Z-test for proportions)
z = (p₁ - p₂) / √[p(1-p)(1/n₁ + 1/n₂)]When: Comparing two conversion rates with large samples
Interpret: If |z| > 1.96, reject null at α=0.05 (two-tailed)
Sample Size (per variant)
n = 2(Z_α + Z_β)² × p(1-p) / (MDE)²When: Planning test before launch (power analysis)
Interpret: Smaller MDE requires exponentially more traffic
Confidence Interval for difference in proportions
CI = (p₁ - p₂) ± Z × √[p₁(1-p₁)/n₁ + p₂(1-p₂)/n₂]When: Estimating uncertainty around lift size
Interpret: If CI excludes 0, effect is statistically significant
Minimum Detectable Effect (MDE)
MDE = (Z_α + Z_β) × √[2p(1-p)/n]When: Understanding sensitivity of your test design
Interpret: The smallest true effect you have good chance of detecting
Sample Ratio Mismatch (Chi-square)
χ² = Σ(observed - expected)² / expectedWhen: Checking if traffic split matches intended randomization
Interpret: If p < 0.01, reject null—you have SRM, investigate
Key Takeaways for Interview Success
Think like a practitioner, not a textbook
Interviewers want to see practical judgment: 'When would you escalate?', 'How would you debug this?', 'What would you check first?' Study the red flags and scenario drills, not just formulas.
Master the modern topics
SRM, sequential testing, CUPED, guardrails, and holdouts separate 2026 candidates from those stuck in 2015. If you don't know these, you'll struggle at scale-up companies (Airbnb, Netflix, Uber, Stripe).
Bridge stats and business
Every technical answer should connect to impact: 'Why does this matter?' Practice explaining p-values, confidence intervals, and power to non-technical stakeholders in under 30 seconds.
Know when you don't know
If asked about cluster randomization or CUPED and you're unfamiliar, say so and show how you'd learn: 'I haven't used CUPED, but I understand variance reduction concepts. I'd start with...' Honesty > BS.
References
- Kohavi, R., Tang, D., & Xu, Y. (2020). Trustworthy Online Controlled Experiments: A Practical Guide to A/B Testing. Cambridge University Press. The definitive reference for modern experimentation platforms and practices.
- Fabijan, A., et al. (2019). Diagnosing Sample Ratio Mismatch in Online Controlled Experiments: A Taxonomy and Rules of Thumb for Practitioners. Proceedings of the 25th ACM SIGKDD Conference on Knowledge Discovery and Data Mining.PDF Link
- Deng, A., Xu, Y., Kohavi, R., & Walker, T. (2013). Improving the Sensitivity of Online Controlled Experiments by Utilizing Pre-Experiment Data. Proceedings of the 6th ACM International Conference on Web Search and Data Mining (WSDM). CUPED introduction and variance reduction techniques.
- Johari, R., Li, L., Weintraub, G., & Ramdas, A. (2022). Sequential Testing for A/B Tests. arXiv:2206.09090.PDF LinkModern sequential testing methods that allow peeking without inflating error rates.
- Demyr, S. (2024). Estimating the Returns to Experimentation: Evidence from Holdback Tests.PDF LinkProgram-level impact and holdout methodology.
- Spotify Engineering. (2020). Guardrail Metrics: How to Protect Your Experiments from Hidden Harm. Practical guidance on counter metrics and non-inferiority testing.
- Statsig Blog. (2024). Introducing Stratified Sampling.LinkModern implementation of stratification in experimentation platforms.
- Nubank Engineering. (2022). 3 Lessons from Implementing CUPED at Nubank.LinkReal-world variance reduction case study.
Found this useful for your interview prep?
Share this guide with others preparing for A/B testing interviews.
Related Resources
Analyze your experiment results.
Plan experiments with proper power analysis.
Determine your minimum detectable effect.
Learn when a result actually matters.
Avoid costly errors with charts and checklist.
100+ experimentation terms explained.
All 20 Questions & Answers
Practice with Real Calculations
Use our free calculators to work through these interview questions with actual numbers. Understanding the calculations behind the concepts will help you ace your interviews.