What should every A/B test design template include?

A complete test design template must include: experiment summary with owner and reviewers, hypothesis in measurable terms, decision rules (ship/don't ship/iterate), experiment setup (variants, randomization, traffic split), metrics (OEC, guardrails, diagnostics, data quality), power and duration plan, trustworthy pre-flight checklist (instrumentation, SRM alerts, randomization checks), monitoring and stop rules, analysis plan, and risks/ethics considerations.

Why do I need a test design template instead of just running experiments?

Templates enforce trustworthy practices before you launch. Research from Fabijan et al. shows most analysis work happens before the experiment starts. Templates prevent common failures like undefined metrics, missing SRM detection, no stop rules, and ambiguous decision criteria. They turn tribal knowledge into institutional knowledge that survives team turnover.

What is the difference between OEC, guardrails, and diagnostic metrics?

OEC (Overall Evaluation Criterion) is your primary success metric, the main thing you are trying to move. Guardrails are secondary metrics you must not degrade (revenue, latency, crash rate). Diagnostics help you understand why results occurred (click rate, page views, session length). Data quality metrics (SRM, logging volume, join rates) ensure your data is trustworthy. Microsoft's platform guidance recommends all four types.

How do I define the hypothesis section in a test design template?

A good hypothesis has three parts: the change you are making, the expected outcome with magnitude, and the reasoning. Example: If we simplify the checkout flow from 4 steps to 2 steps, then conversion rate will improve by at least 3%, because users abandon due to friction and cognitive load. Avoid vague statements like 'improve user experience' without measurable outcomes.

What are pre-flight checks and why are they critical?

Pre-flight checks are validations you run before launching to production. Critical ones include: instrumentation validation (events fire correctly), randomization checks (bucketing logic works), SRM alert configuration (you will detect traffic imbalance), overlap checks (no conflicts with other tests), and bot filtering. Fabijan et al. emphasize most trustworthy work happens before launch, not during analysis.

What should my stop rules and monitoring plan include?

Stop rules define when you shut down an experiment early. Include: guardrail breach thresholds (e.g., revenue drops more than 2%), SRM detection (sample ratio mismatch indicates data corruption), severe UX issues (crash rate spikes, latency exceeds threshold), and who gets paged. Also document how often you check results and what triggers rollback vs. just flagging for investigation.

How detailed should the metrics section be in the template?

Extremely detailed. For each metric, document: exact definition (numerator/denominator), direction of good (increase/decrease), measurement window (7-day, per-session), owner (who maintains this metric), calculation method (user-level aggregation), and any filters or exclusions. Ambiguous metrics are one of the most common experiment failures. The 'Dirty Dozen' paper documents how vague metric definitions lead to wrong conclusions.

Should the template include a filled example or just empty fields?

Both. Provide a blank scaffold teams can copy, and also include a filled example (like a new onboarding flow test) showing what good looks like. The example should demonstrate realistic OEC/guardrail definitions, MDE calculation, SRM response plan, and stop-rule triggers. Filled examples are far more effective for onboarding new team members than abstract instructions.

How to Create an A/B Test Design Template (and Why It Prevents Bad Decisions)

Turn tribal knowledge into institutional rigor.
The fastest way to reduce false wins, missed harms,
and "can't trust the data" calls.

Andrea Corvi

Last updated: 15 April 2025

You launch an experiment, wait for statistical significance, and ship the "winning" variant. Six weeks later, revenue is flat. What went wrong?

Maybe the metric definition was ambiguous. Maybe you failed to detect a Sample Ratio Mismatch. Maybe the stopping rule was never documented, so the test ran until someone found a p-value they liked. These failures are not technical anomalies. They are systemic failures of process.

A structured test design template is the fastest way to prevent these mistakes. It forces you to define success criteria, document decision rules, configure trustworthy pre-flight checks, and specify exactly when to shut down a test before you launch. This article walks through what a production-ready template looks like and how to operationalise it.

Why You Need a Template

Research from Fabijan et al. (ICSE 2018, SEIP 2019) on large-scale experimentation platforms at Microsoft emphasises that most analysis work happens before the experiment starts. The decisions you make in the design phase determine whether the results will be trustworthy and actionable.

A design template enforces this discipline. It transforms implicit tribal knowledge into explicit institutional knowledge that survives team turnover and scales with the organisation.

Reduces False Wins

Explicit decision rules and MEI thresholds prevent you from shipping changes with statistically significant but practically irrelevant effects.

Catches Instrumentation Issues

Pre-flight checks catch SRM, broken logging, and randomisation errors before they corrupt weeks of data collection.

Prevents Ambiguous Outcomes

When you define "ship", "don't ship", and "iterate" criteria up front, there is no room for post-hoc rationalisation.

Checklists save lives

Atul Gawande's research on surgical checklists showed 47% reduction in deaths and 36% reduction in complications. Fabijan et al. applied the same principle to experimentation: explicit checklists prevent catastrophic failures even when experts are involved.

Core Principles

Three principles underpin a trustworthy test design template.

Metric Hierarchy: OEC, Guardrails, Diagnostics, Data Quality

Microsoft's experimentation platform guidance prescribes four types of metrics. Your OEC (Overall Evaluation Criterion) is the primary success metric. Guardrails are metrics you must not degrade (revenue, latency, crashes). Diagnostics help you understand mechanism (clickthrough rate, session length). Data quality metrics ensure trustworthiness: SRM checks, logging volume, join rates.

This hierarchy clarifies decision-making: you ship when the OEC improves beyond the MEI and guardrails are not harmed. You investigate diagnostics only to understand results, not to make decisions.

Pre-Flight Trust Checks

Fabijan et al. found that most experiment failures stem from instrumentation errors, not analysis mistakes. Pre-flight checks validate that your data pipeline is working before you collect production data.

Critical validations include: instrumentation tests (events fire with correct schema), randomisation correctness (bucketing logic is deterministic), SRM alert configuration (you will detect 99:1 imbalance within 24 hours), overlap checks (no conflicts with other running tests), and bot filtering (automated traffic is excluded).

Explicit Decision Rules

Ambiguity is the enemy of trustworthiness. Your template must specify, before launch, the exact conditions under which you will ship, iterate, or kill the feature.

Example decision rule: "Ship if OEC improves by at least 2% (MEI) with p < 0.05 and no guardrails degrade by more than 1%. Iterate if OEC shows positive trend but CI lower bound is below MEI. Don't ship if OEC is flat or negative, or if any guardrail degrades significantly."

The 9-Section Template Walkthrough

A production-ready test design template contains nine sections. Each section addresses a specific failure mode documented in experimentation research.

Summary

Basic metadata for searchability and accountability.

Experiment Name: Short, searchable identifier

Owner: PM or data scientist responsible

Reviewers: Stakeholders who must sign off

Links: PRD, dashboard, code PR

Launch Date: Planned start

Decision Deadline: When results are due

Hypothesis and Decision Rules

State the hypothesis in measurable terms and define exactly when you will ship, iterate, or kill the feature.

Hypothesis Template:

"If we [change], then [outcome with magnitude], because [reasoning]."

Decision Rules:

Ship: OEC improves by MEI+, guardrails safe
Iterate: Positive trend but CI below MEI
Don't ship: Flat, negative, or guardrail harm

Experiment Setup

Specify how the experiment is configured.

Variants: Control vs. Treatment (or A/B/C)

Randomisation Unit: User ID, session, device

Assignment: Hash-based, platform layer

Traffic Split: 50/50, 90/10, etc.

Eligibility: Who is included (logged-in users, country)

Exposure: When users are counted (page load, feature trigger)

Metrics

The most critical section. Ambiguous metric definitions are a top-3 cause of experiment failures according to the "Dirty Dozen" paper.

OEC / Primary Metric

Exact definition with numerator/denominator
Direction of good (increase/decrease)
Measurement window (7-day, per-session)
Owner and calculation method

Guardrails (2-5 metrics)

Revenue, latency, crash rate, DAU retention. Metrics you must not harm.

Diagnostics

Click rate, page views, session length. For understanding mechanism.

Data Quality

SRM checks, logging volume, join rates. Ensures trustworthiness.

Power and Duration Plan

Document the statistical inputs that determine test duration.

Baseline Rate: Historical metric value

MDE: Minimum detectable effect

Alpha: Significance level (0.05)

Power: Target power (80%, 90%)

Duration: Planned runtime

Ramp Plan: 10% → 50% → 100%

Variance Reduction: Document if using CUPED, stratification, or other techniques.

Trustworthy Pre-Flight Checklist

Validate the data pipeline before production launch. Fabijan et al. emphasise this is where most failures can be caught.

Instrumentation Validation: Events fire with correct schema on test traffic

Randomisation Check: Bucketing logic is deterministic and correct

SRM Alert Config: Alert fires on 99:1 traffic imbalance within 24h

Overlap Check: No conflicts with other running experiments

Bot Filtering: Automated traffic excluded from analysis

Sanity-Check Slices: No unexpected imbalance in user segments

Monitoring and Stop Rules

Define what triggers rollback or shutdown. This prevents tests from running indefinitely or causing harm.

Immediate Rollback Triggers:

Guardrail degrades by more than X%
SRM detected (p < 0.001 on traffic ratio)
Crash rate spikes above threshold
P99 latency exceeds Y ms

On-Call and Escalation:

Who gets paged for different alert types, escalation path, rollback procedure.

Analysis Plan

Specify the statistical approach for each metric type.

•

Statistical Test: Two-proportion z-test for binomial, t-test for continuous

•

Multiple Comparisons: Bonferroni correction if testing many metrics or segments

•

Slicing Plan: Country, device, new vs. returning users

•

Null/Ambiguous Handling: How to interpret results when CI includes zero or straddles MEI

Risks and Ethics

Document potential harms and mitigations, especially for sensitive features or personalization.

User Harm: Could this degrade UX for vulnerable segments?

Compliance: GDPR, COPPA, or other privacy constraints

Mitigations: Safeguards for sensitive data or personalization

Filled Example: New Onboarding Screen

Here is a realistic example showing what a completed template looks like for a mobile app onboarding change.

1. Summary

Name: onboarding_screen_simplification_v2

Owner: Alex Chen (PM)

Reviewers: Data Science (Jamie), Eng (Taylor), Legal (Morgan)

Links: PRD [link], Dashboard [link], Code PR [link]

Launch: 2025-04-20, Decision by: 2025-05-10

2. Hypothesis and Decision

Hypothesis: If we reduce onboarding from 5 screens to 2 screens (remove demographics questions), then Day-7 retention will improve by at least 3%, because users abandon due to onboarding friction.

Ship if: Day-7 retention improves ≥3% (MEI), p < 0.05, and no guardrails degrade >1%.

Iterate if: Positive trend but CI lower bound <3%.

Don't ship if: Flat, negative, or guardrail harm.

3. Experiment Setup

Variants: Control (5 screens) vs. Treatment (2 screens)

Randomisation Unit: User ID

Assignment: Hash-based, Layer 3

Traffic Split: 50/50

Eligibility: New installs, iOS + Android

Exposure: On onboarding screen load

4. Metrics

OEC: Day-7 Retention

Definition: Users active on Day 7 / Users exposed

Direction: Increase, Window: 7 days from exposure

Owner: Data Science, Calculation: user-level binary indicator

Guardrails:

Revenue per new user (7-day): must not degrade >1%
Crash rate: must stay <0.5%
Day-1 retention: monitor, flag if drops >2%

Diagnostics:

Onboarding completion rate
Time spent in onboarding
Drop-off by screen

Data Quality:

SRM check (traffic ratio 50:50 within 2%)
Event logging volume (stable ±10%)

5. Power and Duration

Baseline: Day-7 retention = 42%

MDE: 3 percentage points (relative lift 7.1%)

Alpha: 0.05 (two-tailed), Power: 80%

Required Sample: 10,400 users per variant

Duration: 14 days (traffic: 800 new users/day)

Ramp: 10% Day 1-2, 50% Day 3-14

Variance Reduction: None (retention is low variance)

6. Pre-Flight Checklist

Instrumentation: Tested on staging, events fire correctly

Randomisation: Hash-based bucketing validated

SRM Alert: Configured to fire on p < 0.001 traffic imbalance

Overlap: No conflicts with other onboarding tests

Bot Filtering: SDK bots excluded

7. Stop Rules

Rollback if:

Crash rate >0.5% (immediate)
Revenue per user drops >2% after 7 days
SRM detected (p < 0.001)

On-Call: Alex (PM), escalate to Taylor (Eng) for rollback

8. Analysis Plan

Test: Two-proportion z-test

Corrections: No multiple testing (single primary metric)

Slicing: iOS vs. Android, country (US, GB, CA)

Null handling: If CI includes zero, report as inconclusive

9. Risks and Ethics

User Harm: None expected (simplification improves UX)

Compliance: No PII collected in new flow, GDPR compliant

Mitigations: Monitor for bias in retention by demographic (post-hoc analysis)

Copy-Ready Template Scaffold

Below is a blank template scaffold your team can copy into Notion, Confluence, or Google Docs. Replace placeholders with your experiment details.

experiment-design-template.md

# Experiment Design Template

## 1. Summary
- **Experiment Name:** [short_identifier]
- **Owner:** [Name, Role]
- **Reviewers:** [Stakeholder 1], [Stakeholder 2], [Stakeholder 3]
- **Links:**
  - PRD: [link]
  - Dashboard: [link]
  - Code PR: [link]
- **Launch Date:** [YYYY-MM-DD]
- **Decision Deadline:** [YYYY-MM-DD]

## 2. Hypothesis and Decision Rules
**Hypothesis:**
If we [change], then [outcome with magnitude], because [reasoning].

**Decision Rules:**
- **Ship if:** [OEC criteria + guardrail safety]
- **Iterate if:** [positive trend but below MEI]
- **Don't ship if:** [flat, negative, or guardrail harm]

## 3. Experiment Setup
- **Variants:** Control vs. Treatment (or A/B/C)
- **Randomisation Unit:** [User ID, Session, Device]
- **Assignment:** [Hash-based, platform layer]
- **Traffic Split:** [50/50, 90/10, etc.]
- **Eligibility:** [Who is included]
- **Exposure Definition:** [When users are counted]

## 4. Metrics

### OEC / Primary Metric
- **Metric:** [Name]
- **Definition:** [Exact formula: numerator / denominator]
- **Direction of Good:** [Increase / Decrease]
- **Measurement Window:** [7-day, per-session, etc.]
- **Owner:** [Team/Person]
- **Calculation Method:** [User-level aggregation, filters]

### Guardrails (2-5 metrics)
1. **[Guardrail 1]:** [Definition, threshold]
2. **[Guardrail 2]:** [Definition, threshold]
3. **[Guardrail 3]:** [Definition, threshold]

### Diagnostics
- [Diagnostic 1]: [Purpose]
- [Diagnostic 2]: [Purpose]
- [Diagnostic 3]: [Purpose]

### Data Quality Metrics
- **SRM Check:** [Traffic ratio expectation, alert threshold]
- **Logging Volume:** [Expected daily volume, ±X% acceptable]
- **Join Rates:** [If applicable]

## 5. Power and Duration Plan
- **Baseline Rate:** [Historical value]
- **MDE:** [Minimum detectable effect]
- **Alpha:** [0.05 typical]
- **Power:** [80%, 90%]
- **Required Sample Size:** [N per variant]
- **Planned Duration:** [Days/weeks]
- **Ramp Plan:** [10% → 50% → 100%]
- **Variance Reduction:** [CUPED, stratification, or none]

## 6. Trustworthy Pre-Flight Checklist
- [ ] **Instrumentation Validation:** Events fire correctly on test traffic
- [ ] **Randomisation Check:** Bucketing logic is deterministic
- [ ] **SRM Alert Configuration:** Alert fires on traffic imbalance
- [ ] **Overlap Check:** No conflicts with other running tests
- [ ] **Bot Filtering:** Automated traffic excluded
- [ ] **Sanity-Check Slices:** No unexpected segment imbalance

## 7. Monitoring and Stop Rules

### Immediate Rollback Triggers
- Guardrail degrades by more than [X%]
- SRM detected (p < 0.001 on traffic ratio)
- Crash rate exceeds [threshold]
- Latency exceeds [Y ms]

### On-Call and Escalation
- **Primary:** [Name]
- **Escalation:** [Name]
- **Rollback Procedure:** [Link to runbook]

## 8. Analysis Plan
- **Statistical Test:** [Two-proportion z-test, t-test, etc.]
- **Multiple Comparisons:** [Correction method if applicable]
- **Slicing Plan:** [Segments to analyse]
- **Null/Ambiguous Handling:** [How to interpret edge cases]

## 9. Risks and Ethics
- **User Harm:** [Potential harms, especially to vulnerable segments]
- **Compliance:** [GDPR, COPPA, other privacy constraints]
- **Mitigations:** [Safeguards for sensitive data or personalization]

---

**Approvals:**
- [ ] PM: [Name, Date]
- [ ] Data Science: [Name, Date]
- [ ] Engineering: [Name, Date]
- [ ] Legal (if required): [Name, Date]

Operationalising the Template

A template is only valuable if it is used. Here is how to bake it into your experimentation workflow.

PR Checklist Integration

Require a link to a filled design doc in every experiment PR. Gate deployments on documented metrics, stop rules, and SRM alert configuration. No launch without a design doc.

Experiment Creation UI

If you have an experimentation platform, embed template fields in the UI. Make OEC definition, guardrails, MDE, and SRM alert configuration required fields. Netflix's modular experiment setup does this well.

Required Alerts

Automate SRM detection, guardrail monitoring, and data quality checks. Alert if logging volume drops, if traffic ratio deviates from expected, or if any guardrail crosses threshold. Do not rely on manual checks.

Onboarding and Training

Use filled examples (like the onboarding screen test above) to train new PMs and data scientists. Abstract instructions fail. Concrete examples showing what "good" looks like are far more effective.

Key Research Underpinning This Template

Fabijan et al. (ICSE 2018, SEIP 2019)

"The Evolution of Continuous Experimentation in Software Product Development" and "The Benefits of Controlled Experimentation at Scale." Emphasises pre-flight checks and structured experimentation workflows at Microsoft.

Kohavi, Tang, Xu (2020)

"Trustworthy Online Controlled Experiments: A Practical Guide to A/B Testing." Details SRM as a must-detect failure, metric hierarchy (OEC vs. guardrails), and the "Dirty Dozen" pitfalls in metric definition.

Microsoft Experimentation Platform Guidance

Public documentation on metric taxonomy: OEC, guardrails, diagnostics, and data quality metrics as four required categories.

Netflix Experimentation Workflows

Public blog posts on modular experiment setup, automated pre-flight validation, and scalable decision frameworks.

Gawande (2009)

"The Checklist Manifesto." Demonstrates how checklists reduce catastrophic failures in surgery, aviation, and construction. Fabijan et al. applied the same principle to experimentation.

Share this article

Help others run trustworthy experiments

If you found this template valuable, share it with your experimentation team and help spread better A/B testing practices.

Related Resources

Significance Calculator

Analyze your experiment results.

Sample Size Calculator

Plan experiments with proper power analysis.

MDE Calculator

Determine your minimum detectable effect.

Guardrail Metrics

Protect experiments from hidden harm.

Practical Significance

Learn when a result actually matters.

Glossary

100+ experimentation terms explained.

Frequently Asked Questions

Design Better Experiments With Our Calculators

Use our tools to calculate sample size, estimate MDE, and analyse results with confidence.