How to A/B Test Revenue, Duration & AOV
Most A/B testing tools assume every metric is a conversion rate. Revenue per user, session duration, and order value follow different rules. Here is how to handle them.
Your A/B testing tool asks for a conversion rate and a sample size. You enter the numbers, run the test, and get a p-value. Simple enough when you are measuring click-through rates or purchase rates, where each user either converts or does not.
But what happens when you want to test revenue per user, average order value, or session duration? These are not binary outcomes. Each user generates a number, not a yes or no. Most users might generate zero (they did not buy anything), while a handful generate very large values. The statistical approach changes in ways that catch teams off guard.
The biggest difference: you cannot derive variance from a single rate. You need the standard deviation of your metric, computed from real per-user data. Without it, your sample size estimate is meaningless and your significance test is unreliable.
This article covers three things: how significance testing differs for these metrics, how to size your experiment correctly, and most importantly, how to build the per-user data array you need so you can calculate the standard deviation for your specific metric.
Why Conversion Rate Rules Don't Apply Here
Binary metrics (conversion rate, click-through rate) record a 0 or 1 per user. The variance is fully determined by the rate itself: p(1 - p). A z-test handles the comparison. No additional data is needed beyond the counts of successes and totals.
Continuous metrics (revenue, order value, session duration, pages per visit, active days) record a number per user. Most of the distribution clusters near zero while a few outliers sit far to the right. The variance of this distribution depends entirely on how spread out the values are, and you cannot know that without looking at the actual data. The appropriate significance test is a two-sample t-test.
| Aspect | Binary (e.g., Conversion Rate) | Continuous (e.g., Revenue per User) |
|---|---|---|
| Data per user | 0 or 1 | Any number (often £0, sometimes £200+) |
| What you need upfront | Historical conversion rate p | Mean x̄ + standard deviation s |
| Where variance comes from | p(1 - p) | Computed from per-user data |
| Significance test | Z-test / chi-squared | Two-sample t-test |
| Typical traffic needed | Baseline amount | 3 to 10 times more (due to higher variance) |
What Revenue Data Actually Looks Like
Per-user revenue across 100 users, showing the right-skew typical of spending data
How Significance Testing Changes
For binary metrics, you compare two proportions with a z-test. For continuous metrics, you compare two group means with a two-sample t-test. The test measures how far apart the control and variant averages are, relative to the noise in those averages.
The t-test in plain terms
The standard error depends on the standard deviation of your data and the sample size per group. Larger standard deviation means more noise. Larger sample means less noise. The t-value tells you whether the observed difference is large enough to be unlikely due to chance.
The critical difference from binary testing: the noise depends on the standard deviation of your data, which you must compute from actual user-level values. There is no shortcut like p(1 - p) for proportions. This is why the standard deviation calculation (covered in detail below) is so important.
In practice, extract per-user values (revenue, session duration, or whatever your metric is) as a column of individual numbers. Compute the standard deviation of that column. Feed it into your test. If you are using our significance calculator, select the t-test option and enter the mean and standard deviation for each group.
Planning Your Sample Size
Getting sample size right is the difference between a test that produces a clear answer and one that wastes weeks of traffic. For continuous metrics, the sample size formula replaces the p(1 - p) variance term with the standard deviation you computed from historical data. If your estimate of the standard deviation is wrong, the sample size will be wrong too.
Choose your minimum detectable effect
Decide the smallest change worth detecting. For revenue metrics, a 5% relative lift is a common starting point. Use the MDE calculator to explore trade-offs.
Estimate standard deviation
Pull 1 to 3 months of per-user data from recent control traffic. Compute the sample standard deviation. The section below shows you exactly how to do this.
Set power and significance level
Use 80% power and 5% significance level as defaults. For business-critical revenue tests, consider 90% power, which increases the required sample by roughly 30%.
Calculate and plan
Plug your values into the sample size calculator. Divide the result by your daily traffic to get the test duration.
Why Continuous Metrics Need More Traffic
Illustrative sample size per group at 80% power, 5% significance level
Calculating the Standard Deviation You Need
The standard deviation is the single most important input for continuous metric tests. Every sample size estimate and every t-test result depends on it. Here is exactly how to compute it for three common metrics.
The process is the same for every continuous metric
- 1. Export one row per user from your analytics tool or data warehouse.
- 2. Each row contains that user's value for the metric (including zeros).
- 3. Paste the array of values into the Standard Deviation Calculator.
- 4. Use the resulting mean and standard deviation in your sample size calculation.
Example 1: Revenue Per User (ARPU)
E-commerce site, 20 users over one week. Only 3 users made a purchase.
Per-user values (paste into calculator)
0, 0, 0, 24.99, 0, 0, 0, 0, 12.5, 0, 0, 0, 0, 0, 89.99, 0, 0, 0, 0, 0
Mean
£6.37
Standard Deviation
£20.60
The standard deviation is more than 3 times the mean. This is typical for revenue data where most users spend nothing and a few spend a lot.
Example 2: Average Session Duration (minutes)
SaaS product, 20 users. Each value is that user's average session length over 2 weeks.
Per-user values (paste into calculator)
1.2, 4.8, 0.3, 8.1, 2.5, 0.7, 5.2, 15.4, 2, 0.4, 3.8, 7.2, 1.1, 3.3, 6, 1.6, 0.5, 9.7, 3.4, 1.9
Mean
4.0 min
Standard Deviation
3.7 min
Session duration is less skewed than revenue but still has meaningful spread. A few power users pull the mean up.
Example 3: Active Days (out of 28)
Mobile app, 20 users. Each value is total days the user was active in a 4-week window.
Per-user values (paste into calculator)
3, 22, 1, 14, 5, 19, 8, 2, 12, 6, 25, 3, 10, 1, 16, 4, 18, 3, 7, 11
Mean
9.5 days
Standard Deviation
7.3 days
Active days have a more uniform spread. The SD is still nearly 80% of the mean, meaning you need more traffic than you might expect to detect a shift in engagement.
Use recent control traffic
Pull 1 to 3 months of data from users who were not in any experiment. Seasonality matters: holiday periods inflate variance, so match the time window to when your test will actually run.
One value per user, not per day
Computing standard deviation from daily averages (e.g., daily ARPU) dramatically underestimates the real user-level variance. Always use one row per user with their total or average value for the period.
Always include zeros
Users who did not purchase, did not engage, or had zero activity are real data points. Excluding them is the single most common mistake and will give you a falsely low standard deviation.
Handle extreme outliers carefully
If a few users dominate variance (e.g., one user spending 100x the average), consider winsorising at the 99th percentile. This caps extreme values without removing users and gives a more stable variance estimate.
Checklist Before You Run the Test
Compute standard deviation from per-user data
Pull user-level values, include zeros, and compute the sample standard deviation. Do not guess or use a proxy from a different time period.
Use the right sample size formula
A binary sample size calculator will give the wrong answer for revenue or duration metrics. Use a tool that accepts standard deviation as an input, like our sample size calculator.
Match the measurement window
If your test runs for 2 weeks, compute standard deviation from 2-week per-user totals. A 1-week window will underestimate variance for metrics that accumulate over time (like active days or total revenue).
Segment if variance differs materially
Mobile and desktop users often have different spending patterns. If the standard deviation differs by more than 30% between segments, compute separate estimates and plan accordingly.
Analyse at the user level, not the session level
Per-session analysis introduces correlation issues because one user can have many sessions. Per-user analysis gives a cleaner variance estimate and avoids Simpson's paradox.
Accept that you may need more traffic
High-variance continuous metrics sometimes require 5 to 10 times more traffic than conversion rate tests. If that exceeds your capacity, consider testing a proxy binary metric instead (e.g., purchase rate rather than revenue per user), or explore variance reduction techniques like CUPED.
References
- Georgiev, G. (2017). Statistical Significance for Non-Binomial Metrics
- BlastX (2025). Statistical Significance for Session-Based Metrics in A/B Tests
- Frost, J. (2024). Comparing Hypothesis Tests for Continuous and Discrete Data
- Deng, A. (2021). Chapter 8: Statistical Analysis of A/B Tests
- Allegro Tech (2019). Calculating the Required Sample Size for A/B Testing
Found this useful?
Share it with your team so everyone knows how to test continuous metrics properly.
Frequently Asked Questions
Ready to Test Your Continuous Metrics?
Start with the standard deviation calculator, then plan your sample size and run the test.