How to Create an A/B Test Design Template (and Why It Prevents Bad Decisions)
Turn tribal knowledge into institutional rigor.
The fastest way to reduce false wins, missed harms,
and "can't trust the data" calls.
You launch an experiment, wait for statistical significance, and ship the "winning" variant. Six weeks later, revenue is flat. What went wrong?
Maybe the metric definition was ambiguous. Maybe you failed to detect a Sample Ratio Mismatch. Maybe the stopping rule was never documented, so the test ran until someone found a p-value they liked. These failures are not technical anomalies. They are systemic failures of process.
A structured test design template is the fastest way to prevent these mistakes. It forces you to define success criteria, document decision rules, configure trustworthy pre-flight checks, and specify exactly when to shut down a test before you launch. This article walks through what a production-ready template looks like and how to operationalise it.
Why You Need a Template
Research from Fabijan et al. (ICSE 2018, SEIP 2019) on large-scale experimentation platforms at Microsoft emphasises that most analysis work happens before the experiment starts. The decisions you make in the design phase determine whether the results will be trustworthy and actionable.
A design template enforces this discipline. It transforms implicit tribal knowledge into explicit institutional knowledge that survives team turnover and scales with the organisation.
Reduces False Wins
Explicit decision rules and MEI thresholds prevent you from shipping changes with statistically significant but practically irrelevant effects.
Catches Instrumentation Issues
Pre-flight checks catch SRM, broken logging, and randomisation errors before they corrupt weeks of data collection.
Prevents Ambiguous Outcomes
When you define "ship", "don't ship", and "iterate" criteria up front, there is no room for post-hoc rationalisation.
Checklists save lives
Atul Gawande's research on surgical checklists showed 47% reduction in deaths and 36% reduction in complications. Fabijan et al. applied the same principle to experimentation: explicit checklists prevent catastrophic failures even when experts are involved.
Core Principles
Three principles underpin a trustworthy test design template.
Metric Hierarchy: OEC, Guardrails, Diagnostics, Data Quality
Microsoft's experimentation platform guidance prescribes four types of metrics. Your OEC (Overall Evaluation Criterion) is the primary success metric. Guardrails are metrics you must not degrade (revenue, latency, crashes). Diagnostics help you understand mechanism (clickthrough rate, session length). Data quality metrics ensure trustworthiness: SRM checks, logging volume, join rates.
This hierarchy clarifies decision-making: you ship when the OEC improves beyond the MEI and guardrails are not harmed. You investigate diagnostics only to understand results, not to make decisions.
Pre-Flight Trust Checks
Fabijan et al. found that most experiment failures stem from instrumentation errors, not analysis mistakes. Pre-flight checks validate that your data pipeline is working before you collect production data.
Critical validations include: instrumentation tests (events fire with correct schema), randomisation correctness (bucketing logic is deterministic), SRM alert configuration (you will detect 99:1 imbalance within 24 hours), overlap checks (no conflicts with other running tests), and bot filtering (automated traffic is excluded).
Explicit Decision Rules
Ambiguity is the enemy of trustworthiness. Your template must specify, before launch, the exact conditions under which you will ship, iterate, or kill the feature.
Example decision rule: "Ship if OEC improves by at least 2% (MEI) with p < 0.05 and no guardrails degrade by more than 1%. Iterate if OEC shows positive trend but CI lower bound is below MEI. Don't ship if OEC is flat or negative, or if any guardrail degrades significantly."
The 9-Section Template Walkthrough
A production-ready test design template contains nine sections. Each section addresses a specific failure mode documented in experimentation research.
Summary
Basic metadata for searchability and accountability.
Hypothesis and Decision Rules
State the hypothesis in measurable terms and define exactly when you will ship, iterate, or kill the feature.
Hypothesis Template:
"If we [change], then [outcome with magnitude], because [reasoning]."
Decision Rules:
- Ship: OEC improves by MEI+, guardrails safe
- Iterate: Positive trend but CI below MEI
- Don't ship: Flat, negative, or guardrail harm
Experiment Setup
Specify how the experiment is configured.
Metrics
The most critical section. Ambiguous metric definitions are a top-3 cause of experiment failures according to the "Dirty Dozen" paper.
OEC / Primary Metric
- Exact definition with numerator/denominator
- Direction of good (increase/decrease)
- Measurement window (7-day, per-session)
- Owner and calculation method
Guardrails (2-5 metrics)
Revenue, latency, crash rate, DAU retention. Metrics you must not harm.
Diagnostics
Click rate, page views, session length. For understanding mechanism.
Data Quality
SRM checks, logging volume, join rates. Ensures trustworthiness.
Power and Duration Plan
Document the statistical inputs that determine test duration.
Variance Reduction: Document if using CUPED, stratification, or other techniques.
Trustworthy Pre-Flight Checklist
Validate the data pipeline before production launch. Fabijan et al. emphasise this is where most failures can be caught.
Monitoring and Stop Rules
Define what triggers rollback or shutdown. This prevents tests from running indefinitely or causing harm.
Immediate Rollback Triggers:
- Guardrail degrades by more than X%
- SRM detected (p < 0.001 on traffic ratio)
- Crash rate spikes above threshold
- P99 latency exceeds Y ms
On-Call and Escalation:
Who gets paged for different alert types, escalation path, rollback procedure.
Analysis Plan
Specify the statistical approach for each metric type.
Risks and Ethics
Document potential harms and mitigations, especially for sensitive features or personalization.
Filled Example: New Onboarding Screen
Here is a realistic example showing what a completed template looks like for a mobile app onboarding change.
1. Summary
Name: onboarding_screen_simplification_v2
Owner: Alex Chen (PM)
Reviewers: Data Science (Jamie), Eng (Taylor), Legal (Morgan)
Links: PRD [link], Dashboard [link], Code PR [link]
Launch: 2025-04-20, Decision by: 2025-05-10
2. Hypothesis and Decision
Hypothesis: If we reduce onboarding from 5 screens to 2 screens (remove demographics questions), then Day-7 retention will improve by at least 3%, because users abandon due to onboarding friction.
Ship if: Day-7 retention improves ≥3% (MEI), p < 0.05, and no guardrails degrade >1%.
Iterate if: Positive trend but CI lower bound <3%.
Don't ship if: Flat, negative, or guardrail harm.
3. Experiment Setup
Variants: Control (5 screens) vs. Treatment (2 screens)
Randomisation Unit: User ID
Assignment: Hash-based, Layer 3
Traffic Split: 50/50
Eligibility: New installs, iOS + Android
Exposure: On onboarding screen load
4. Metrics
OEC: Day-7 Retention
Definition: Users active on Day 7 / Users exposed
Direction: Increase, Window: 7 days from exposure
Owner: Data Science, Calculation: user-level binary indicator
Guardrails:
- Revenue per new user (7-day): must not degrade >1%
- Crash rate: must stay <0.5%
- Day-1 retention: monitor, flag if drops >2%
Diagnostics:
- Onboarding completion rate
- Time spent in onboarding
- Drop-off by screen
Data Quality:
- SRM check (traffic ratio 50:50 within 2%)
- Event logging volume (stable ±10%)
5. Power and Duration
Baseline: Day-7 retention = 42%
MDE: 3 percentage points (relative lift 7.1%)
Alpha: 0.05 (two-tailed), Power: 80%
Required Sample: 10,400 users per variant
Duration: 14 days (traffic: 800 new users/day)
Ramp: 10% Day 1-2, 50% Day 3-14
Variance Reduction: None (retention is low variance)
6. Pre-Flight Checklist
Instrumentation: Tested on staging, events fire correctly
Randomisation: Hash-based bucketing validated
SRM Alert: Configured to fire on p < 0.001 traffic imbalance
Overlap: No conflicts with other onboarding tests
Bot Filtering: SDK bots excluded
7. Stop Rules
Rollback if:
- Crash rate >0.5% (immediate)
- Revenue per user drops >2% after 7 days
- SRM detected (p < 0.001)
On-Call: Alex (PM), escalate to Taylor (Eng) for rollback
8. Analysis Plan
Test: Two-proportion z-test
Corrections: No multiple testing (single primary metric)
Slicing: iOS vs. Android, country (US, GB, CA)
Null handling: If CI includes zero, report as inconclusive
9. Risks and Ethics
User Harm: None expected (simplification improves UX)
Compliance: No PII collected in new flow, GDPR compliant
Mitigations: Monitor for bias in retention by demographic (post-hoc analysis)
Copy-Ready Template Scaffold
Below is a blank template scaffold your team can copy into Notion, Confluence, or Google Docs. Replace placeholders with your experiment details.
# Experiment Design Template ## 1. Summary - **Experiment Name:** [short_identifier] - **Owner:** [Name, Role] - **Reviewers:** [Stakeholder 1], [Stakeholder 2], [Stakeholder 3] - **Links:** - PRD: [link] - Dashboard: [link] - Code PR: [link] - **Launch Date:** [YYYY-MM-DD] - **Decision Deadline:** [YYYY-MM-DD] ## 2. Hypothesis and Decision Rules **Hypothesis:** If we [change], then [outcome with magnitude], because [reasoning]. **Decision Rules:** - **Ship if:** [OEC criteria + guardrail safety] - **Iterate if:** [positive trend but below MEI] - **Don't ship if:** [flat, negative, or guardrail harm] ## 3. Experiment Setup - **Variants:** Control vs. Treatment (or A/B/C) - **Randomisation Unit:** [User ID, Session, Device] - **Assignment:** [Hash-based, platform layer] - **Traffic Split:** [50/50, 90/10, etc.] - **Eligibility:** [Who is included] - **Exposure Definition:** [When users are counted] ## 4. Metrics ### OEC / Primary Metric - **Metric:** [Name] - **Definition:** [Exact formula: numerator / denominator] - **Direction of Good:** [Increase / Decrease] - **Measurement Window:** [7-day, per-session, etc.] - **Owner:** [Team/Person] - **Calculation Method:** [User-level aggregation, filters] ### Guardrails (2-5 metrics) 1. **[Guardrail 1]:** [Definition, threshold] 2. **[Guardrail 2]:** [Definition, threshold] 3. **[Guardrail 3]:** [Definition, threshold] ### Diagnostics - [Diagnostic 1]: [Purpose] - [Diagnostic 2]: [Purpose] - [Diagnostic 3]: [Purpose] ### Data Quality Metrics - **SRM Check:** [Traffic ratio expectation, alert threshold] - **Logging Volume:** [Expected daily volume, ±X% acceptable] - **Join Rates:** [If applicable] ## 5. Power and Duration Plan - **Baseline Rate:** [Historical value] - **MDE:** [Minimum detectable effect] - **Alpha:** [0.05 typical] - **Power:** [80%, 90%] - **Required Sample Size:** [N per variant] - **Planned Duration:** [Days/weeks] - **Ramp Plan:** [10% → 50% → 100%] - **Variance Reduction:** [CUPED, stratification, or none] ## 6. Trustworthy Pre-Flight Checklist - [ ] **Instrumentation Validation:** Events fire correctly on test traffic - [ ] **Randomisation Check:** Bucketing logic is deterministic - [ ] **SRM Alert Configuration:** Alert fires on traffic imbalance - [ ] **Overlap Check:** No conflicts with other running tests - [ ] **Bot Filtering:** Automated traffic excluded - [ ] **Sanity-Check Slices:** No unexpected segment imbalance ## 7. Monitoring and Stop Rules ### Immediate Rollback Triggers - Guardrail degrades by more than [X%] - SRM detected (p < 0.001 on traffic ratio) - Crash rate exceeds [threshold] - Latency exceeds [Y ms] ### On-Call and Escalation - **Primary:** [Name] - **Escalation:** [Name] - **Rollback Procedure:** [Link to runbook] ## 8. Analysis Plan - **Statistical Test:** [Two-proportion z-test, t-test, etc.] - **Multiple Comparisons:** [Correction method if applicable] - **Slicing Plan:** [Segments to analyse] - **Null/Ambiguous Handling:** [How to interpret edge cases] ## 9. Risks and Ethics - **User Harm:** [Potential harms, especially to vulnerable segments] - **Compliance:** [GDPR, COPPA, other privacy constraints] - **Mitigations:** [Safeguards for sensitive data or personalization] --- **Approvals:** - [ ] PM: [Name, Date] - [ ] Data Science: [Name, Date] - [ ] Engineering: [Name, Date] - [ ] Legal (if required): [Name, Date]
Operationalising the Template
A template is only valuable if it is used. Here is how to bake it into your experimentation workflow.
PR Checklist Integration
Require a link to a filled design doc in every experiment PR. Gate deployments on documented metrics, stop rules, and SRM alert configuration. No launch without a design doc.
Experiment Creation UI
If you have an experimentation platform, embed template fields in the UI. Make OEC definition, guardrails, MDE, and SRM alert configuration required fields. Netflix's modular experiment setup does this well.
Required Alerts
Automate SRM detection, guardrail monitoring, and data quality checks. Alert if logging volume drops, if traffic ratio deviates from expected, or if any guardrail crosses threshold. Do not rely on manual checks.
Onboarding and Training
Use filled examples (like the onboarding screen test above) to train new PMs and data scientists. Abstract instructions fail. Concrete examples showing what "good" looks like are far more effective.
Key Research Underpinning This Template
Fabijan et al. (ICSE 2018, SEIP 2019)
"The Evolution of Continuous Experimentation in Software Product Development" and "The Benefits of Controlled Experimentation at Scale." Emphasises pre-flight checks and structured experimentation workflows at Microsoft.
Kohavi, Tang, Xu (2020)
"Trustworthy Online Controlled Experiments: A Practical Guide to A/B Testing." Details SRM as a must-detect failure, metric hierarchy (OEC vs. guardrails), and the "Dirty Dozen" pitfalls in metric definition.
Microsoft Experimentation Platform Guidance
Public documentation on metric taxonomy: OEC, guardrails, diagnostics, and data quality metrics as four required categories.
Netflix Experimentation Workflows
Public blog posts on modular experiment setup, automated pre-flight validation, and scalable decision frameworks.
Gawande (2009)
"The Checklist Manifesto." Demonstrates how checklists reduce catastrophic failures in surgery, aviation, and construction. Fabijan et al. applied the same principle to experimentation.
Help others run trustworthy experiments
If you found this template valuable, share it with your experimentation team and help spread better A/B testing practices.
Related Resources
Analyze your experiment results.
Plan experiments with proper power analysis.
Determine your minimum detectable effect.
Protect experiments from hidden harm.
Learn when a result actually matters.
100+ experimentation terms explained.
Frequently Asked Questions
Design Better Experiments With Our Calculators
Use our tools to calculate sample size, estimate MDE, and analyse results with confidence.