Can I use a t-test for A/B testing revenue per user?

Yes. The Central Limit Theorem guarantees that the sampling distribution of means approaches normality as sample size grows, regardless of how skewed the underlying data is. For most A/B tests with thousands of users per group, a two-sample t-test on revenue data works well. The key requirement is computing the standard deviation from individual user-level values, not from daily or weekly aggregates.

Do I include users who spent nothing when calculating standard deviation?

Yes, always. Every user in the experiment gets a row, even if their value is zero. Non-purchasers are real data points that reflect the majority of your traffic. Excluding zeros would underestimate variance and lead to an undersized experiment that cannot reliably detect real effects.

What is the difference between binomial and continuous metrics in A/B testing?

Binomial metrics record binary outcomes (converted or not, clicked or not) and have variance determined entirely by the conversion rate. Continuous metrics record a numerical value per user (revenue, time on site, pages viewed) and require you to estimate variance from historical data. This affects which significance test you use, how you calculate sample size, and how much traffic you need.

What array should I paste into a standard deviation calculator for ARPU?

Export one value per user from your analytics tool. For revenue per user, each row is the total amount that user spent during the measurement period. Users who did not purchase have a value of zero. The resulting array might look like [0, 0, 24.99, 0, 0, 0, 12.50, 0, 89.99, 0]. Paste these values into a standard deviation calculator to get the sample standard deviation.

How does high variance in revenue data affect A/B test results?

High variance inflates the standard error of the mean, making it harder to detect real differences between groups. A small number of high-spending users can dramatically increase variance. This means tests on revenue metrics typically require 3 to 10 times more traffic than tests on conversion rates to achieve the same statistical power.

Why does my revenue test need so much more traffic than my conversion test?

Revenue data has a much higher coefficient of variation (standard deviation divided by mean) than binary conversion data. Most users spend nothing, while a few spend large amounts. This spread forces you to collect more data before the difference between control and variant becomes statistically distinguishable from noise. The sample size formula scales with variance, so higher variance means proportionally more users.

Can I use daily averages instead of per-user values for standard deviation?

No. Computing standard deviation from daily averages (for example, average revenue per user per day) dramatically underestimates the true user-level variance. Each day's average smooths out the individual variation between users, hiding the real spread. Always compute standard deviation from one value per user, not from aggregated time-series data.

How to A/B Test Revenue, Duration & AOV

Most A/B testing tools assume every metric is a conversion rate. Revenue per user, session duration, and order value follow different rules. Here is how to handle them.

Andrea Corvi

Last updated: 15 September 2025

Your A/B testing tool asks for a conversion rate and a sample size. You enter the numbers, run the test, and get a p-value. Simple enough when you are measuring click-through rates or purchase rates, where each user either converts or does not.

But what happens when you want to test revenue per user, average order value, or session duration? These are not binary outcomes. Each user generates a number, not a yes or no. Most users might generate zero (they did not buy anything), while a handful generate very large values. The statistical approach changes in ways that catch teams off guard.

The biggest difference: you cannot derive variance from a single rate. You need the standard deviation of your metric, computed from real per-user data. Without it, your sample size estimate is meaningless and your significance test is unreliable.

This article covers three things: how significance testing differs for these metrics, how to size your experiment correctly, and most importantly, how to build the per-user data array you need so you can calculate the standard deviation for your specific metric.

Why Conversion Rate Rules Don't Apply Here

Binary metrics (conversion rate, click-through rate) record a 0 or 1 per user. The variance is fully determined by the rate itself: p(1 - p). A z-test handles the comparison. No additional data is needed beyond the counts of successes and totals.

Continuous metrics (revenue, order value, session duration, pages per visit, active days) record a number per user. Most of the distribution clusters near zero while a few outliers sit far to the right. The variance of this distribution depends entirely on how spread out the values are, and you cannot know that without looking at the actual data. The appropriate significance test is a two-sample t-test.

Aspect	Binary (e.g., Conversion Rate)	Continuous (e.g., Revenue per User)
Data per user	0 or 1	Any number (often £0, sometimes £200+)
What you need upfront	Historical conversion rate p	Mean x̄ + standard deviation s
Where variance comes from	p(1 - p)	Computed from per-user data
Significance test	Z-test / chi-squared	Two-sample t-test
Typical traffic needed	Baseline amount	3 to 10 times more (due to higher variance)

What Revenue Data Actually Looks Like

Per-user revenue across 100 users, showing the right-skew typical of spending data

Non-purchasers (£0)

Purchasers

How Significance Testing Changes

For binary metrics, you compare two proportions with a z-test. For continuous metrics, you compare two group means with a two-sample t-test. The test measures how far apart the control and variant averages are, relative to the noise in those averages.

The t-test in plain terms

t = (mean_control − mean_variant) / standard error

The standard error depends on the standard deviation of your data and the sample size per group. Larger standard deviation means more noise. Larger sample means less noise. The t-value tells you whether the observed difference is large enough to be unlikely due to chance.

The critical difference from binary testing: the noise depends on the standard deviation of your data, which you must compute from actual user-level values. There is no shortcut like p(1 - p) for proportions. This is why the standard deviation calculation (covered in detail below) is so important.

In practice, extract per-user values (revenue, session duration, or whatever your metric is) as a column of individual numbers. Compute the standard deviation of that column. Feed it into your test. If you are using our significance calculator, select the t-test option and enter the mean and standard deviation for each group.

Planning Your Sample Size

Getting sample size right is the difference between a test that produces a clear answer and one that wastes weeks of traffic. For continuous metrics, the sample size formula replaces the p(1 - p) variance term with the standard deviation you computed from historical data. If your estimate of the standard deviation is wrong, the sample size will be wrong too.

Choose your minimum detectable effect

Decide the smallest change worth detecting. For revenue metrics, a 5% relative lift is a common starting point. Use the MDE calculator to explore trade-offs.

Estimate standard deviation

Pull 1 to 3 months of per-user data from recent control traffic. Compute the sample standard deviation. The section below shows you exactly how to do this.

Set power and significance level

Use 80% power and 5% significance level as defaults. For business-critical revenue tests, consider 90% power, which increases the required sample by roughly 30%.

Calculate and plan

Plug your values into the sample size calculator. Divide the result by your daily traffic to get the test duration.

Why Continuous Metrics Need More Traffic

Illustrative sample size per group at 80% power, 5% significance level

Binary (conversion rate)

Continuous (revenue per user)

Calculating the Standard Deviation You Need

The standard deviation is the single most important input for continuous metric tests. Every sample size estimate and every t-test result depends on it. Here is exactly how to compute it for three common metrics.

The process is the same for every continuous metric

1. Export one row per user from your analytics tool or data warehouse.
2. Each row contains that user's value for the metric (including zeros).
3. Paste the array of values into the Standard Deviation Calculator.
4. Use the resulting mean and standard deviation in your sample size calculation.

Example 1: Revenue Per User (ARPU)

E-commerce site, 20 users over one week. Only 3 users made a purchase.

Per-user values (paste into calculator)

0, 0, 0, 24.99, 0, 0, 0, 0, 12.5, 0, 0, 0, 0, 0, 89.99, 0, 0, 0, 0, 0

Mean

£6.37

Standard Deviation

£20.60

The standard deviation is more than 3 times the mean. This is typical for revenue data where most users spend nothing and a few spend a lot.

Example 2: Average Session Duration (minutes)

SaaS product, 20 users. Each value is that user's average session length over 2 weeks.

Per-user values (paste into calculator)

1.2, 4.8, 0.3, 8.1, 2.5, 0.7, 5.2, 15.4, 2, 0.4, 3.8, 7.2, 1.1, 3.3, 6, 1.6, 0.5, 9.7, 3.4, 1.9

Mean

4.0 min

Standard Deviation

3.7 min

Session duration is less skewed than revenue but still has meaningful spread. A few power users pull the mean up.

Example 3: Active Days (out of 28)

Mobile app, 20 users. Each value is total days the user was active in a 4-week window.

Per-user values (paste into calculator)

3, 22, 1, 14, 5, 19, 8, 2, 12, 6, 25, 3, 10, 1, 16, 4, 18, 3, 7, 11

Mean

9.5 days

Standard Deviation

7.3 days

Active days have a more uniform spread. The SD is still nearly 80% of the mean, meaning you need more traffic than you might expect to detect a shift in engagement.

Use recent control traffic

Pull 1 to 3 months of data from users who were not in any experiment. Seasonality matters: holiday periods inflate variance, so match the time window to when your test will actually run.

One value per user, not per day

Computing standard deviation from daily averages (e.g., daily ARPU) dramatically underestimates the real user-level variance. Always use one row per user with their total or average value for the period.

Always include zeros

Users who did not purchase, did not engage, or had zero activity are real data points. Excluding them is the single most common mistake and will give you a falsely low standard deviation.

Handle extreme outliers carefully

If a few users dominate variance (e.g., one user spending 100x the average), consider winsorising at the 99th percentile. This caps extreme values without removing users and gives a more stable variance estimate.

Checklist Before You Run the Test

Compute standard deviation from per-user data

Pull user-level values, include zeros, and compute the sample standard deviation. Do not guess or use a proxy from a different time period.

Use the right sample size formula

A binary sample size calculator will give the wrong answer for revenue or duration metrics. Use a tool that accepts standard deviation as an input, like our sample size calculator.

Match the measurement window

If your test runs for 2 weeks, compute standard deviation from 2-week per-user totals. A 1-week window will underestimate variance for metrics that accumulate over time (like active days or total revenue).

Segment if variance differs materially

Mobile and desktop users often have different spending patterns. If the standard deviation differs by more than 30% between segments, compute separate estimates and plan accordingly.

Analyse at the user level, not the session level

Per-session analysis introduces correlation issues because one user can have many sessions. Per-user analysis gives a cleaner variance estimate and avoids Simpson's paradox.

Accept that you may need more traffic

High-variance continuous metrics sometimes require 5 to 10 times more traffic than conversion rate tests. If that exceeds your capacity, consider testing a proxy binary metric instead (e.g., purchase rate rather than revenue per user), or explore variance reduction techniques like CUPED.

References

Share this article

Found this useful?

Share it with your team so everyone knows how to test continuous metrics properly.

Related Resources

Significance Calculator

Analyze your experiment results.

Sample Size Calculator

Plan experiments with proper power analysis.

Practical Significance

Learn when a result actually matters.

MDE Calculator

Determine your minimum detectable effect.

Frequently Asked Questions

Ready to Test Your Continuous Metrics?

Start with the standard deviation calculator, then plan your sample size and run the test.

Standard Deviation Calculator

Plan Sample Size

Analyse Results

Why Conversion Rate Rules Don't Apply Here

Aspect	Binary (e.g., Conversion Rate)	Continuous (e.g., Revenue per User)
Data per user	0 or 1	Any number (often £0, sometimes £200+)
What you need upfront	Historical conversion rate p	Mean x̄ + standard deviation s
Where variance comes from	p(1 - p)	Computed from per-user data
Significance test	Z-test / chi-squared	Two-sample t-test
Typical traffic needed	Baseline amount	3 to 10 times more (due to higher variance)

Planning Your Sample Size

Choose your minimum detectable effect

Decide the smallest change worth detecting. For revenue metrics, a 5% relative lift is a common starting point. Use the MDE calculator to explore trade-offs.

Estimate standard deviation

Pull 1 to 3 months of per-user data from recent control traffic. Compute the sample standard deviation. The section below shows you exactly how to do this.

Set power and significance level

Use 80% power and 5% significance level as defaults. For business-critical revenue tests, consider 90% power, which increases the required sample by roughly 30%.

Calculate and plan

Plug your values into the sample size calculator. Divide the result by your daily traffic to get the test duration.

Checklist Before You Run the Test

Compute standard deviation from per-user data

Pull user-level values, include zeros, and compute the sample standard deviation. Do not guess or use a proxy from a different time period.

Use the right sample size formula

A binary sample size calculator will give the wrong answer for revenue or duration metrics. Use a tool that accepts standard deviation as an input, like our sample size calculator.

Match the measurement window

Segment if variance differs materially

Mobile and desktop users often have different spending patterns. If the standard deviation differs by more than 30% between segments, compute separate estimates and plan accordingly.

Analyse at the user level, not the session level

Per-session analysis introduces correlation issues because one user can have many sessions. Per-user analysis gives a cleaner variance estimate and avoids Simpson's paradox.

Accept that you may need more traffic

Frequently Asked Questions

How to A/B Test Revenue, Duration & AOV

Why Conversion Rate Rules Don't Apply Here

What Revenue Data Actually Looks Like

How to read this chart

How Significance Testing Changes

Planning Your Sample Size

Why Continuous Metrics Need More Traffic

How to read this chart

Calculating the Standard Deviation You Need

Example 1: Revenue Per User (ARPU)

Example 2: Average Session Duration (minutes)

Example 3: Active Days (out of 28)

Checklist Before You Run the Test

References

Found this useful?

Related Resources

Frequently Asked Questions

Can I use a t-test for A/B testing revenue per user?

Do I include users who spent nothing when calculating standard deviation?

What is the difference between binomial and continuous metrics in A/B testing?

What array should I paste into a standard deviation calculator for ARPU?

How does high variance in revenue data affect A/B test results?

Why does my revenue test need so much more traffic than my conversion test?

Can I use daily averages instead of per-user values for standard deviation?

Ready to Test Your Continuous Metrics?

How to A/B Test Revenue, Duration & AOV

Why Conversion Rate Rules Don't Apply Here

What Revenue Data Actually Looks Like

How to read this chart

How Significance Testing Changes

Planning Your Sample Size

Why Continuous Metrics Need More Traffic

How to read this chart

Calculating the Standard Deviation You Need

Example 1: Revenue Per User (ARPU)

Example 2: Average Session Duration (minutes)

Example 3: Active Days (out of 28)

Checklist Before You Run the Test

References

Found this useful?

Related Resources

Frequently Asked Questions

Can I use a t-test for A/B testing revenue per user?

Do I include users who spent nothing when calculating standard deviation?

What is the difference between binomial and continuous metrics in A/B testing?

What array should I paste into a standard deviation calculator for ARPU?

How does high variance in revenue data affect A/B test results?

Why does my revenue test need so much more traffic than my conversion test?

Can I use daily averages instead of per-user values for standard deviation?

Ready to Test Your Continuous Metrics?