Guardrail Metrics in A/B Testing
Your success metric improved. Great. But did anything else break? Guardrail metrics are the safety nets that catch what your primary metrics miss.
What Are Guardrail Metrics?
Guardrail metrics are critical business indicators you monitor during an experiment to detect unintended harm. They are not the metrics you expect to improve. They are the metrics you refuse to let deteriorate beyond an acceptable threshold.
Think of them as the smoke detectors of experimentation. Your success metric tells you whether the kitchen renovation looks good. Your guardrail metrics tell you whether the house is on fire.
The term “counter-metrics” is sometimes used interchangeably. Regardless of naming, the purpose is identical: catch negative side effects that your primary metric cannot see.
Trust Guardrails
Validate experiment integrity. The most important is Sample Ratio Mismatch (SRM), which checks whether users were split correctly between control and variant. If the ratio is off, no metric result can be trusted. Every experiment should include this guardrail.
Organisational Guardrails
Protect core business metrics that the experiment is not directly targeting. Revenue per user, retention rate, page load time, and customer satisfaction are common choices. These prevent optimising one metric at the expense of everything else.
Why Guardrail Metrics Matter
Every experimentation program has cautionary tales where a single-metric focus caused real damage. These three scenarios, drawn from industry experience, illustrate the pattern.
The Clickbait Algorithm
A recommendation algorithm increased clicks by 40% but tanked customer satisfaction because it pushed low-quality content. The team tracked clicks religiously. Nobody tracked content quality.
The Speed Trap
A checkout flow optimised for speed made it easier for fraudsters to exploit the system. Conversion improved. Fraud losses wiped out the gains and then some.
The Cannibalization
A team celebrated 25% higher feature adoption, only to discover it was cannibalising their premium tier. Free usage went up. Revenue went down.
Airbnb Case Study
At Airbnb, a team ran a test where house rules were hidden at checkout. Bookings increased, but review ratings (a guardrail metric) declined. Out of thousands of experiments running each month, guardrails trigger roughly 25 for review. About 80% proceed after stakeholder discussion, and approximately 5 experiments are paused per month, preventing potentially significant damage to critical metrics.
Source: Airbnb Engineering, “Designing Experimentation Guardrails”
Choosing the Right Guardrail Metrics
Not every metric deserves guardrail status. Effective guardrails share five characteristics, as outlined by Optimizely and adopted widely across the industry.
Relevant
Directly tied to a critical business function, not a vanity metric.
Sensitive
Detects small deviations before they become large-scale problems.
Specific
Points toward what went wrong, not just that something changed.
Timely
Moves fast enough to enable corrective action within the test window.
Actionable
The team can take concrete steps when the guardrail triggers.
Common Guardrail Metrics by Business Model
How to Test Guardrails: Inferiority vs Non-Inferiority
Guardrail metrics require a fundamentally different statistical test than success metrics. You are not trying to prove the variant is better. You are trying to prove it is not meaningfully worse.
Two approaches exist, and they differ in a subtle but critical way.
Inferiority Testing
Ship if there is no evidence of harm.
- Simpler to implement, no margin needed
- Good starting point for new teams
- Absence of evidence is not evidence of absence
- Underpowered tests may miss real harm
Non-Inferiority Testing
Ship if there is evidence that harm stays within tolerance.
- Statistically rigorous, active proof of safety
- Forces teams to define acceptable cost
- Requires defining a non-inferiority margin (NIM)
- Needs more sample size for adequate power
Practical Example
A team launches a redesigned “Recommended For You” section to increase engagement. Their guardrail metric is “Best Sellers” engagement, which they expect to remain stable. With non-inferiority testing and a 1% NIM, they can ship confidently if Best Sellers engagement drops by no more than 1%. This forces an explicit trade-off conversation: “Is a 0.8% drop in Best Sellers acceptable for a 5% lift in personalised recommendations?”
Source: Spotify Confidence, “Better Product Decisions with Guardrail Metrics”
For a deeper dive into non-inferiority testing mechanics, confidence intervals, and visual interpretation, read our Non-Inferiority Testing Guide.
The Decision Framework
Spotify formalised a four-metric taxonomy that has become a reference model for the industry. Every metric in an experiment is classified into one of four types, each tested differently.
Success Metrics
Superiority testThe metrics you expect to improve. At least one must show statistically significant improvement.
Guardrail Metrics
Non-inferiority testMetrics you do not expect to improve but refuse to let deteriorate past a defined threshold.
Deterioration Metrics
Inferiority testMetrics where you ship unless you find evidence of harm. A lighter alternative to non-inferiority.
Quality Metrics
Various testsValidate experiment integrity. Includes SRM checks, pre-exposure bias, and data quality tests.
Ship Decision Rule
A variant ships only when all three conditions are met. If any condition fails, the experiment does not ship.
Success metric
At least one primary metric shows a statistically significant improvement
All guardrails pass
Every guardrail metric passes its non-inferiority test (not meaningfully worse)
Quality checks pass
No sample ratio mismatch, no pre-exposure bias, data integrity confirmed
Adapted from Schultzberg, Ankargren & Franberg (2024), “Risk-Aware Product Decisions in A/B Tests with Multiple Metrics”, Spotify Engineering
Statistical Power and Multiple Guardrails
Adding guardrail metrics creates a statistical challenge that many teams overlook. While false positive rates (Type I errors) do not need correction for guardrails, false negative rates (Type II errors) do.
The reason is structural. For a ship decision, all guardrails must pass simultaneously. This means the chance of incorrectly passing at least one guardrail (false positive) is already constrained. But the chance of incorrectly failing at least one guardrail (false negative) compounds with each additional metric.
The Power Collapse Problem
If each guardrail is independently powered to 80%, combined power drops rapidly:
Spotify's Correction Formula
To maintain adequate combined power, adjust the per-metric Type II error rate:
beta* = beta / (G + 1)
Where G is the number of guardrail metrics
Our MDE Calculator and Sample Size Calculator already include this correction. Set the guardrail metrics count and the adjustment is applied automatically.
Combined Power vs Number of Guardrail Metrics
Maturity Progression: Start Simple
Spotify's experience across 300+ teams reveals a consistent maturity curve. Teams that try to adopt non-inferiority testing on day one typically stall. Teams that start simple and graduate over time sustain adoption.
Start with Inferiority Tests
Ship if there is no evidence of harm. No margin needed, no extra sample size. The goal is to get guardrails into the workflow at all.
Define Non-Inferiority Margins
Start defining acceptable deterioration thresholds for your most critical guardrails. Teams build intuition for what a "1% drop in revenue" actually means.
Full Non-Inferiority Testing
Almost all teams use non-inferiority tests. The statistical language becomes natural. Guardrail discussions are part of every experiment design review.
Key insight from Spotify: “It is a mistake to let perfect be the enemy of good. It is obviously better for product decisions to use guardrail metrics with inferiority tests than to not include guardrail metrics at all.” This aligns with what we see across experimentation team maturity models: progressive adoption beats big-bang transformation.
Common Mistakes
Too many guardrails
With 10 guardrail metrics at alpha 0.05, the false alarm rate is roughly 40%. With 25, it reaches 73%. Every additional guardrail makes false alarms more likely, blocking safe features from shipping and slowing down the experimentation program.
Not powering for guardrails
Teams calculate sample size for their success metric and assume guardrails will be fine. Without the power correction, combined power collapses with just a handful of guardrails. Use the Spotify beta/(G+1) formula.
Treating guardrails as success metrics
Guardrails should not use superiority tests. If you test a guardrail for superiority and find nothing, that is not evidence of safety. Use non-inferiority or inferiority tests to get the right statistical question.
No pre-defined thresholds
Without a non-inferiority margin defined before the test, teams debate endlessly about whether a 0.3% revenue drop is acceptable. Pre-commitment removes ambiguity. Define what "acceptable harm" means for each guardrail in advance.
Applying multiple testing correction to guardrails
Traditional Bonferroni or Sidak corrections for Type I errors are unnecessary for guardrails. All guardrails must pass simultaneously, so the false positive rate is already controlled. Apply corrections only to success metrics.
Ignoring the Sample Ratio Mismatch
SRM is the most important trust guardrail and should be included in every single experiment. If users were not split correctly, no other metric result is reliable. Check our glossary entry on SRM or run an SRM check with our calculator.
False Alarm Rate vs Number of Guardrails
References
- Schultzberg, Ankargren & Franberg. “Risk-Aware Product Decisions in A/B Tests with Multiple Metrics”. Spotify Engineering, 2024
- Spotify Confidence Team. “Better Product Decisions with Guardrail Metrics”. Spotify Confidence Blog, 2024
- Tatiana Xifara. “Designing Experimentation Guardrails”. Airbnb Engineering Blog
- Optimizely. “Understanding and Implementing Guardrail Metrics”. Optimizely Insights
- Eppo. “What Are Guardrail Metrics? With Examples”. Eppo Blog
- Mixpanel. “Guardrail Metrics: The Complete Guide to Balanced Product Growth”. Mixpanel Blog, 2025
Help others protect their experiments
If you found this article valuable, share it with your experimentation team and help spread better A/B testing practices.
Related Resources
Analyze your experiment results.
Plan experiments with proper power analysis.
Determine your minimum detectable effect.
Test whether a variant is not worse.
Learn when a result actually matters.
How to structure your experimentation org.
Frequently Asked Questions
Put Your Guardrails Into Practice
Our calculators already support guardrail metric corrections. Set your guardrail count and see how it affects your required sample size and minimum detectable effect.