What are guardrail metrics in A/B testing?

Guardrail metrics are secondary metrics monitored during an A/B test to ensure the experiment does not cause unintended harm to critical business indicators. Unlike success metrics that measure improvement, guardrail metrics detect deterioration in areas like revenue, retention, page load speed, or user satisfaction. They act as safety nets that catch negative side effects before they reach production.

What is the difference between guardrail metrics and success metrics?

Success metrics measure whether your experiment achieves its intended goal, such as increasing conversion rate or click-through rate. Guardrail metrics measure whether the experiment causes collateral damage to other important indicators. Success metrics use superiority tests (proving the variant is better), while guardrail metrics use non-inferiority or inferiority tests (proving the variant is not meaningfully worse).

How many guardrail metrics should an A/B test have?

Most teams should use 3 to 5 guardrail metrics per experiment. Each additional guardrail increases the risk of false alarms (incorrectly flagging harm on safe features) and reduces combined statistical power. With 3 guardrails at alpha 0.05, the false alarm rate rises to approximately 14%. With 10 guardrails it reaches 40%. Focus on the metrics that would cause the most business damage if they deteriorated.

Should guardrail metrics use non-inferiority or inferiority testing?

Non-inferiority testing is the statistically rigorous approach because it requires active evidence that harm stays within an acceptable tolerance. Inferiority testing is a simpler alternative that ships the feature unless there is evidence of harm. Spotify recommends starting with inferiority tests and graduating to non-inferiority tests as the team matures. Most mature experimentation teams adopt non-inferiority testing within a year.

Do guardrail metrics need multiple testing correction?

False positive rates (Type I errors) do not need correction for guardrail metrics because all guardrails must pass simultaneously for a ship decision. However, false negative rates (Type II errors) do require correction. Spotify recommends adjusting power using the formula beta-star equals beta divided by the number of guardrails plus one. Without this correction, combined power drops rapidly as the number of guardrails increases.

What are examples of common guardrail metrics?

Common guardrail metrics include page load time, revenue per user, session duration, bounce rate, customer satisfaction scores, error rates, support ticket volume, and retention rate. The specific guardrails depend on your business model. E-commerce companies often guard average order value and cart abandonment. SaaS products typically guard core feature engagement and platform usage time.

How does Spotify use guardrail metrics?

Spotify classifies experiment metrics into four types: success metrics tested with superiority tests, guardrail metrics tested with non-inferiority tests, deterioration metrics tested with inferiority tests, and quality metrics like sample ratio mismatch checks. They ship a feature only when at least one success metric improves and all guardrail metrics show no significant deterioration beyond the pre-defined tolerance.

Guardrail Metrics in A/B Testing

Your success metric improved. Great. But did anything else break? Guardrail metrics are the safety nets that catch what your primary metrics miss.

Andrea Corvi

Last updated: 2 February 2026

What Are Guardrail Metrics?

Guardrail metrics are critical business indicators you monitor during an experiment to detect unintended harm. They are not the metrics you expect to improve. They are the metrics you refuse to let deteriorate beyond an acceptable threshold.

Think of them as the smoke detectors of experimentation. Your success metric tells you whether the kitchen renovation looks good. Your guardrail metrics tell you whether the house is on fire.

The term “counter-metrics” is sometimes used interchangeably. Regardless of naming, the purpose is identical: catch negative side effects that your primary metric cannot see.

Trust Guardrails

Validate experiment integrity. The most important is Sample Ratio Mismatch (SRM), which checks whether users were split correctly between control and variant. If the ratio is off, no metric result can be trusted. Every experiment should include this guardrail.

Organisational Guardrails

Protect core business metrics that the experiment is not directly targeting. Revenue per user, retention rate, page load time, and customer satisfaction are common choices. These prevent optimising one metric at the expense of everything else.

Why Guardrail Metrics Matter

Every experimentation program has cautionary tales where a single-metric focus caused real damage. These three scenarios, drawn from industry experience, illustrate the pattern.

The Clickbait Algorithm

A recommendation algorithm increased clicks by 40% but tanked customer satisfaction because it pushed low-quality content. The team tracked clicks religiously. Nobody tracked content quality.

The Speed Trap

A checkout flow optimised for speed made it easier for fraudsters to exploit the system. Conversion improved. Fraud losses wiped out the gains and then some.

The Cannibalization

A team celebrated 25% higher feature adoption, only to discover it was cannibalising their premium tier. Free usage went up. Revenue went down.

Airbnb Case Study

At Airbnb, a team ran a test where house rules were hidden at checkout. Bookings increased, but review ratings (a guardrail metric) declined. Out of thousands of experiments running each month, guardrails trigger roughly 25 for review. About 80% proceed after stakeholder discussion, and approximately 5 experiments are paused per month, preventing potentially significant damage to critical metrics.

Source: Airbnb Engineering, “Designing Experimentation Guardrails”

Choosing the Right Guardrail Metrics

Not every metric deserves guardrail status. Effective guardrails share five characteristics, as outlined by Optimizely and adopted widely across the industry.

Relevant

Directly tied to a critical business function, not a vanity metric.

Sensitive

Detects small deviations before they become large-scale problems.

Specific

Points toward what went wrong, not just that something changed.

Timely

Moves fast enough to enable corrective action within the test window.

Actionable

The team can take concrete steps when the guardrail triggers.

Common Guardrail Metrics by Business Model

E-commerce

SaaS

Marketplace

Content

How to Test Guardrails: Inferiority vs Non-Inferiority

Guardrail metrics require a fundamentally different statistical test than success metrics. You are not trying to prove the variant is better. You are trying to prove it is not meaningfully worse.

Two approaches exist, and they differ in a subtle but critical way.

Inferiority Testing

Ship if there is no evidence of harm.

Simpler to implement, no margin needed
Good starting point for new teams
Absence of evidence is not evidence of absence
Underpowered tests may miss real harm

Non-Inferiority Testing

Ship if there is evidence that harm stays within tolerance.

Statistically rigorous, active proof of safety
Forces teams to define acceptable cost
Requires defining a non-inferiority margin (NIM)
Needs more sample size for adequate power

Practical Example

A team launches a redesigned “Recommended For You” section to increase engagement. Their guardrail metric is “Best Sellers” engagement, which they expect to remain stable. With non-inferiority testing and a 1% NIM, they can ship confidently if Best Sellers engagement drops by no more than 1%. This forces an explicit trade-off conversation: “Is a 0.8% drop in Best Sellers acceptable for a 5% lift in personalised recommendations?”

Source: Spotify Confidence, “Better Product Decisions with Guardrail Metrics”

For a deeper dive into non-inferiority testing mechanics, confidence intervals, and visual interpretation, read our Non-Inferiority Testing Guide.

The Decision Framework

Spotify formalised a four-metric taxonomy that has become a reference model for the industry. Every metric in an experiment is classified into one of four types, each tested differently.

Success Metrics

Superiority test

The metrics you expect to improve. At least one must show statistically significant improvement.

Guardrail Metrics

Non-inferiority test

Metrics you do not expect to improve but refuse to let deteriorate past a defined threshold.

Deterioration Metrics

Inferiority test

Metrics where you ship unless you find evidence of harm. A lighter alternative to non-inferiority.

Quality Metrics

Various tests

Validate experiment integrity. Includes SRM checks, pre-exposure bias, and data quality tests.

Ship Decision Rule

A variant ships only when all three conditions are met. If any condition fails, the experiment does not ship.

Success metric

At least one primary metric shows a statistically significant improvement

AND

All guardrails pass

Every guardrail metric passes its non-inferiority test (not meaningfully worse)

AND

Quality checks pass

No sample ratio mismatch, no pre-exposure bias, data integrity confirmed

SHIP

Adapted from Schultzberg, Ankargren & Franberg (2024), “Risk-Aware Product Decisions in A/B Tests with Multiple Metrics”, Spotify Engineering

Statistical Power and Multiple Guardrails

Adding guardrail metrics creates a statistical challenge that many teams overlook. While false positive rates (Type I errors) do not need correction for guardrails, false negative rates (Type II errors) do.

The reason is structural. For a ship decision, all guardrails must pass simultaneously. This means the chance of incorrectly passing at least one guardrail (false positive) is already constrained. But the chance of incorrectly failing at least one guardrail (false negative) compounds with each additional metric.

The Power Collapse Problem

If each guardrail is independently powered to 80%, combined power drops rapidly:

1 guardrail80%

3 guardrails51%

5 guardrails33%

10 guardrails11%

Spotify's Correction Formula

To maintain adequate combined power, adjust the per-metric Type II error rate:

beta* = beta / (G + 1)

Where G is the number of guardrail metrics

Our MDE Calculator and Sample Size Calculator already include this correction. Set the guardrail metrics count and the adjustment is applied automatically.

Combined Power vs Number of Guardrail Metrics

Without correction (naive)

With Spotify correction

Maturity Progression: Start Simple

Spotify's experience across 300+ teams reveals a consistent maturity curve. Teams that try to adopt non-inferiority testing on day one typically stall. Teams that start simple and graduate over time sustain adoption.

Month 1

Start with Inferiority Tests

Ship if there is no evidence of harm. No margin needed, no extra sample size. The goal is to get guardrails into the workflow at all.

Months 2-6

Define Non-Inferiority Margins

Start defining acceptable deterioration thresholds for your most critical guardrails. Teams build intuition for what a "1% drop in revenue" actually means.

Within a Year

Full Non-Inferiority Testing

Almost all teams use non-inferiority tests. The statistical language becomes natural. Guardrail discussions are part of every experiment design review.

Key insight from Spotify: “It is a mistake to let perfect be the enemy of good. It is obviously better for product decisions to use guardrail metrics with inferiority tests than to not include guardrail metrics at all.” This aligns with what we see across experimentation team maturity models: progressive adoption beats big-bang transformation.

Common Mistakes

Too many guardrails

With 10 guardrail metrics at alpha 0.05, the false alarm rate is roughly 40%. With 25, it reaches 73%. Every additional guardrail makes false alarms more likely, blocking safe features from shipping and slowing down the experimentation program.

Not powering for guardrails

Teams calculate sample size for their success metric and assume guardrails will be fine. Without the power correction, combined power collapses with just a handful of guardrails. Use the Spotify beta/(G+1) formula.

Treating guardrails as success metrics

Guardrails should not use superiority tests. If you test a guardrail for superiority and find nothing, that is not evidence of safety. Use non-inferiority or inferiority tests to get the right statistical question.

No pre-defined thresholds

Without a non-inferiority margin defined before the test, teams debate endlessly about whether a 0.3% revenue drop is acceptable. Pre-commitment removes ambiguity. Define what "acceptable harm" means for each guardrail in advance.

Applying multiple testing correction to guardrails

Traditional Bonferroni or Sidak corrections for Type I errors are unnecessary for guardrails. All guardrails must pass simultaneously, so the false positive rate is already controlled. Apply corrections only to success metrics.

Ignoring the Sample Ratio Mismatch

SRM is the most important trust guardrail and should be included in every single experiment. If users were not split correctly, no other metric result is reliable. Check our glossary entry on SRM or run an SRM check with our calculator.

False Alarm Rate vs Number of Guardrails

Manageable (<15%)

Elevated (15-30%)

Excessive (>30%)

References

Share this article

Help others protect their experiments

If you found this article valuable, share it with your experimentation team and help spread better A/B testing practices.

Related Resources

Significance Calculator

Analyze your experiment results.

Sample Size Calculator

Plan experiments with proper power analysis.

MDE Calculator

Determine your minimum detectable effect.

Non-Inferiority Testing

Test whether a variant is not worse.

Practical Significance

Learn when a result actually matters.

Team Models

How to structure your experimentation org.

Frequently Asked Questions

Put Your Guardrails Into Practice

Our calculators already support guardrail metric corrections. Set your guardrail count and see how it affects your required sample size and minimum detectable effect.

Significance Calculator

Sample Size Calculator

Estimate MDE

What Are Guardrail Metrics?

Think of them as the smoke detectors of experimentation. Your success metric tells you whether the kitchen renovation looks good. Your guardrail metrics tell you whether the house is on fire.

The term “counter-metrics” is sometimes used interchangeably. Regardless of naming, the purpose is identical: catch negative side effects that your primary metric cannot see.

Statistical Power and Multiple Guardrails

Frequently Asked Questions

Guardrail Metrics in A/B Testing

What Are Guardrail Metrics?

Trust Guardrails

Organisational Guardrails

Why Guardrail Metrics Matter

The Clickbait Algorithm

The Speed Trap

The Cannibalization

Airbnb Case Study

Choosing the Right Guardrail Metrics

Relevant

Sensitive

Specific

Timely

Actionable

Common Guardrail Metrics by Business Model

How to read this chart

How to Test Guardrails: Inferiority vs Non-Inferiority

Inferiority Testing

Non-Inferiority Testing

Practical Example

The Decision Framework

Success Metrics

Guardrail Metrics

Deterioration Metrics

Quality Metrics

Ship Decision Rule

Statistical Power and Multiple Guardrails

The Power Collapse Problem

Spotify's Correction Formula

Combined Power vs Number of Guardrail Metrics

How to read this chart

Maturity Progression: Start Simple

Start with Inferiority Tests

Define Non-Inferiority Margins

Full Non-Inferiority Testing

Common Mistakes

Too many guardrails

Not powering for guardrails

Treating guardrails as success metrics

No pre-defined thresholds

Applying multiple testing correction to guardrails

Ignoring the Sample Ratio Mismatch

False Alarm Rate vs Number of Guardrails

How to read this chart

References

Help others protect their experiments

Related Resources

Frequently Asked Questions

What are guardrail metrics in A/B testing?

What is the difference between guardrail metrics and success metrics?

How many guardrail metrics should an A/B test have?

Should guardrail metrics use non-inferiority or inferiority testing?

Do guardrail metrics need multiple testing correction?

What are examples of common guardrail metrics?

How does Spotify use guardrail metrics?

Put Your Guardrails Into Practice

Guardrail Metrics in A/B Testing

What Are Guardrail Metrics?

Trust Guardrails

Organisational Guardrails

Why Guardrail Metrics Matter

The Clickbait Algorithm

The Speed Trap

The Cannibalization

Airbnb Case Study

Choosing the Right Guardrail Metrics

Relevant

Sensitive

Specific

Timely

Actionable

Common Guardrail Metrics by Business Model

How to read this chart

How to Test Guardrails: Inferiority vs Non-Inferiority

Inferiority Testing

Non-Inferiority Testing

Practical Example

The Decision Framework

Success Metrics