Welcome to our comprehensive A/B testing glossary, your go-to resource for mastering the language of experimentation and optimisation. Whether you're a seasoned expert or just starting your journey into the world of data-driven decision-making, this glossary serves as a valuable reference guide. Explore clear and concise definitions for a wide range of terms, from foundational concepts to advanced statistical techniques, all tailored specifically for the realm of A/B testing and conversion rate optimisation. Empower yourself with the knowledge to navigate this field confidently and make informed choices that drive meaningful results for your business.
An A/A test is a type of experiment in which the control and treatment groups receive the same experience. It is used to validate the testing framework and ensure that there are no systemic issues or biases.
An A/B test is a controlled experiment that compares two variants (A and B) to determine which one performs better for a specific goal or metric.
A/B testing, also known as split testing or bucket testing, is a method of comparing two versions of a web page, app, or other experience to determine which one performs better.
An A/B/n test is a variation of A/B testing where multiple variants (n) are tested against a control (A) to determine the best-performing variation.
The systematic computational analysis of data or statistics. It is used for the discovery, interpretation, and communication of meaningful patterns in data.
Average Revenue Per User (ARPU) is a metric that calculates the average revenue generated per unique user over a given period.
The absolute difference is the magnitude of the difference between two values, ignoring the sign.
The alternative hypothesis (H1) is the hypothesis that the researcher hopes to support with evidence from the experiment. It represents the possibility of a difference or effect between the control and treatment groups.
The average, also known as the mean, is a measure of central tendency that represents the sum of all values divided by the total number of values.
Average Order Value (AOV) is a metric that calculates the average amount spent per order or transaction.
Average Revenue Per User (ARPU) is a metric that calculates the average revenue generated per unique user over a given period.
Alpha (α) is the significance level or the probability of a Type I error, which is the probability of rejecting the null hypothesis when it is true.
Bayesian inference is a statistical method that uses prior knowledge or beliefs to update the probability of an event occurring based on new data or evidence.
A type of experiment that compares the performance of a metric before and after a change or treatment is introduced.
A binomial metric is a metric that has only two possible outcomes, such as success or failure, conversion or non-conversion.
The Bonferroni correction is a statistical adjustment used to reduce the chances of obtaining false-positive results (Type I errors) when conducting multiple hypothesis tests.
Bounce rate is a metric that measures the percentage of visitors who leave a website after viewing only a single page.
Beta (β) is the probability of a Type II error, which is the probability of failing to reject the null hypothesis when the alternative hypothesis is true.
The process of randomly assigning users or traffic to different groups or variations in an A/B test.
Causal inference is the process of drawing conclusions about causal connections from data, using statistical methods and reasoning to determine whether a cause-and-effect relationship exists between variables.
The chi-square test is a statistical test used to determine if there is a significant difference between the observed and expected frequencies of a categorical variable.
The process of clicking through an online advertisement to the advertiser's destination.
Click-Through Rate (CTR) is a metric that measures the ratio of clicks on a specific link or advertisement to the number of impressions or views.
A subset of behavioral analytics that takes the data from a given dataset and rather than looking at all users as one unit, it breaks them into related groups for analysis.
A confidence interval is a range of values that is likely to contain the true population parameter with a certain level of confidence.
Conversion funnel optimisation is the process of improving the flow of users or customers through a multi-step process, such as a website's checkout or registration flow, by identifying and addressing bottlenecks or points of high drop-off.
Conversion rate is a metric that measures the percentage of visitors who complete a desired action, such as making a purchase or filling out a form.
The control group is the baseline or unchanged version of an experience against which the treatment group is compared.
Correlation is a statistical measure that indicates the strength and direction of the relationship between two variables.
Call To Action (CTA) is a prompt on a website that tells the user to take some specified action, such as "Sign Up", "Buy Now", or "Click Here".
Data distribution refers to the pattern or shape of the data points in a dataset, which can be visualised using histograms, box plots, or other graphical representations.
The process of making organisational decisions based on actual data rather than intuition or observation alone.
Decision trees are a type of machine learning model that uses a tree-like structure to represent decisions and their potential consequences. They are used for tasks such as classification, regression, and making predictions based on a set of input features.
The difference or change in a metric between the control group and a treatment group in an A/B test.
A dependent variable is the variable being measured or observed in an experiment, and its value depends on the independent variable.
Descriptive statistics are summary measures that describe the characteristics of a dataset, such as mean, median, mode, standard deviation, and quartiles.
Difference-in-differences is a statistical technique used to estimate the effect of a specific intervention or treatment by comparing the changes in outcomes over time between a treated group and a control group.
Effect size is a quantitative measure of the magnitude or strength of a relationship or difference between two groups or variables.
A metric that measures the level of engagement that a piece of created content is receiving from an audience.
The process of determining what changes the user is supposed to see, based on the targeting rules of the experiments.
An experiment is a controlled study or test conducted to investigate the effect of one or more independent variables on a dependent variable.
Experiment duration is the length of time an A/B test or experiment is run to collect data and observe the effects of the variations.
An exposed user is a user who has been shown or exposed to the treatment or variation being tested in an A/B experiment.
The exposure ratio is the ratio of users exposed to the treatment group compared to the control group in an A/B test.
The false discovery rate (FDR) is a statistical method used to control for multiple comparisons and reduce the chance of false positives in hypothesis testing.
A false negative is an incorrect result in which a test fails to identify a condition or effect that is actually present.
A false positive is an incorrect result in which a test identifies a condition or effect that is not actually present.
A technique used in software development to enable or disable features or functionalities for specific users or groups.
Fisher's exact test is a statistical test used to analyse contingency tables, particularly when sample sizes are small or when the data violates the assumptions of the chi-square test.
A philosophical approach to statistics that treats probability as a long-run frequency and relies on sample data to make inferences about populations.
Funnel analysis is a technique used to analyse the conversion rate at each step of a multi-step process, such as a checkout or registration flow.
Gamification is the application of game design elements and principles, such as points, badges, leaderboards, and challenges, in non-game contexts to encourage engagement, motivation, and desired behaviors.
The geometric mean is a type of average that is calculated by taking the nth root of the product of n numbers.
A goodness of fit test is a statistical test used to determine whether a sample of data follows a hypothesized distribution or not.
Granular metrics are metrics that are broken down into smaller, more specific sub-metrics or segments, providing more detailed insights into user behavior or performance.
A method of conducting an experiment where the data is analysed periodically, and the experiment is stopped or continued based on predefined stopping rules or criteria.
Growth hacking is a process of rapid experimentation and data-driven strategies to identify and implement the most effective and efficient ways to grow a business or user base.
A secondary metric or KPI that is monitored during an A/B test to ensure that the variations do not have unintended negative consequences.
A graphical representation of data where the individual values contained in a matrix are represented as colors.
The hero is a large banner image prominently placed on a web page, generally in the front and center. It often includes a CTA and is used to drive visitors' attention to a primary goal.
A hypothesis is a proposed explanation or assumption about a phenomenon that is tested through experimentation or observation.
Hypothesis testing is a statistical method used to evaluate the evidence against a null hypothesis and determine whether it should be rejected or not.
A holdout group is a subset of users who are not exposed to any variation in an A/B test, serving as a control group for future experiments.
The number of times a post, ad, or webpage is viewed.
The result of an A/B test where there is not enough evidence to declare a winner or loser, often due to insufficient statistical power.
An independent variable is a factor or condition that is manipulated or varied in an experiment to observe its effect on the dependent variable.
A statistical test used to determine whether a new treatment or intervention is worse than an existing experience.
Innovation is the process of introducing new ideas, methods, or products that create value and drive progress. It involves identifying opportunities, generating creative solutions, and implementing them successfully.
Intent-to-treat analysis is a principle in A/B testing where all participants are included in the analysis regardless of whether they completed the experiment or not.
An interaction effect occurs when the effect of one independent variable on the dependent variable varies depending on the level of another independent variable.
The Kruskal-Wallis test is a non-parametric statistical test used to compare the medians of three or more independent groups.
The Kolmogorov-Smirnov test is a non-parametric statistical test used to determine if two samples are drawn from the same continuous distribution.
A standalone web page, created specifically for a marketing or advertising campaign.
Lift is a metric that measures the relative increase or decrease in a target metric (e.g., conversion rate) between the control and treatment groups in an A/B test.
The likelihood ratio test is a statistical test used to compare the fit of two nested models, one of which is a special case of the other.
Logistic regression is a statistical model used for binary classification problems, where the goal is to predict the probability of an event occurring or not, based on one or more independent variables.
The variation or treatment group that performs worse than the control group in an A/B test, based on the defined success metric.
The Mann-Whitney U test is a non-parametric statistical test used to compare the distributions of two independent groups or samples.
A measure of the uncertainty or potential inaccuracy in a statistical estimate or result, often expressed as a range of values around the calculated value.
Maximum likelihood estimation is a method of estimating the parameters of a statistical model by finding the parameter values that maximize the likelihood of observing the data.
The median is the middle value in a sorted dataset, dividing the data into two equal halves.
A metric is a quantifiable measure used to track and assess the performance of a product, process, or system.
The minimum detectable effect (MDE) is the smallest effect size or difference between the control and treatment groups that an experiment has a reasonable chance of detecting as statistically significant.
The smallest effect size or difference between the control and treatment groups that would be considered practically or commercially significant, even if it is statistically significant.
The mode is the value or values that occur most frequently in a dataset.
In statistics, the Greek letter Mu (μ) is used to denote the population mean, which is the average of all the values in a population. It is a parameter of the population and is unknown in most cases, but it can be estimated from a sample.
A multi-armed bandit is a problem in which a decision-maker must choose between multiple options or "arms" to maximize a reward or minimize a cost, while simultaneously learning from the outcomes of previous choices.
The issue that arises when conducting multiple statistical tests simultaneously, increasing the probability of obtaining false-positive results (Type I errors).
A statistical technique used to analyse data that arises from more than one variable.
Multivariate Testing (MVT) is a technique that tests multiple variations of multiple components simultaneously to determine the best combination.
The negative binomial distribution is a probability distribution that models the number of successes in a sequence of independent and identically distributed Bernoulli trials before a specified number of failures occur.
A non-inferiority test is a statistical test used to determine whether a new treatment or intervention is not worse than an existing standard by more than a pre-specified margin or non-inferiority margin. If you want to know more about it, we have a dedicated article that goes in-depth about non-inferiority tests.
The normal distribution, also known as the Gaussian distribution, is a continuous probability distribution that is symmetric and bell-shaped, widely used in statistics and probability theory.
The null hypothesis (H0) is the default or baseline hypothesis in statistical testing, which assumes that there is no significant difference or effect between the groups or variables being studied.
An observational study is a non-experimental study in which researchers observe and measure variables without intervening or manipulating the system.
A comprehensive metric or set of criteria used to evaluate the overall success or failure of an A/B test, often combining multiple metrics or Key Performance Indicators (KPIs) into a single measure.
A statistical test that considers only one direction or tail of a distribution, used when the alternative hypothesis specifies the direction of the effect. It can be looking for superiority or inferiority depending on the direction of interest.
The odds ratio is a statistical measure used to quantify the association between an exposure and an outcome, often used in logistic regression and case-control studies.
An overpowered experiment is one with an excessively large sample size, resulting in the ability to detect even trivial or unimportant effects as statistically significant.
A statistical test used to compare the means of two related or paired samples, often used when the same subjects or items are measured under different conditions.
The p-value is a measure of the strength of evidence against the null hypothesis, representing the probability of observing results as extreme as or more extreme than the observed results, assuming the null hypothesis is true.
The Pareto principle, also known as the 80/20 rule, states that roughly 80% of the effects come from 20% of the causes.
The practice of prematurely inspecting the results of an ongoing A/B test, which can lead to biased decisions and increased risk of false-positive findings.
A permutation test is a non-parametric statistical test that involves rearranging or permuting the observed data to estimate the probability of obtaining a particular result under the null hypothesis.
Personalisation is the process of tailoring products, services, or experiences to individual users or customers based on their preferences, behaviors, and characteristics, with the aim of providing a more relevant and engaging experience.
The placebo effect is a phenomenon in which a person experiences a perceived benefit or improvement in their condition after receiving an inert or sham treatment, due to psychological factors rather than the treatment itself.
The Poisson distribution is a discrete probability distribution that models the number of events occurring in a fixed interval of time or space, given a known average rate and independently of the time since the last event.
The entire group or set of individuals, objects, or observations that a sample is intended to represent or make inferences about.
The probability of correctly rejecting the null hypothesis when the alternative hypothesis is true, or the ability of a test to detect an effect if it exists.
Power analysis is a statistical technique used to determine the minimum sample size required to detect a specified effect size with a desired level of statistical power.
Practical significance refers to the real-world or practical implications of a statistically significant result, considering factors such as effect size, cost, and relevance.
Propensity analysis is a statistical technique used to estimate the likelihood or propensity of an individual or group to exhibit a particular behavior or characteristic, based on various factors or covariates.
Propensity score matching is a statistical technique used in causal inference to account for confounding factors and estimate the effect of a treatment or intervention by matching treated and control units based on their estimated propensity scores.
A Q-Q (quantile-quantile) plot is a graphical method used to compare the distributions of two datasets or to assess whether a dataset follows a specified theoretical distribution.
A quantile is a value that divides a distribution into equal groups or intervals, such as quartiles, deciles, or percentiles.
A quasi-experiment is a type of study design that lacks random assignment of participants to treatment and control groups, but still aims to establish a cause-and-effect relationship between variables.
Randomization is the process of randomly assigning participants or subjects to different treatment groups in an experiment, minimizing the potential for systematic bias.
A random sample is a subset of a population that is chosen in such a way that each member of the population has an equal chance of being selected.
Regression analysis is a statistical technique used to model the relationship between a dependent variable and one or more independent variables.
The difference or change in a metric between the control group and a treatment group in an A/B test, expressed as a percentage of the control group value.
The relative difference is a measure of the difference between two values, expressed as a proportion or percentage of the larger or reference value.
Resampling is a statistical technique that involves repeatedly drawing samples from a dataset and analysing them to estimate the properties of a population or to test hypotheses.
The response rate is the percentage of users who complete a desired action or conversion in an experiment or campaign.
The risk ratio is a measure of the relative risk or probability of an event occurring in one group compared to another group, often used in epidemiological studies.
A performance measure used to evaluate the efficiency or profitability of an investment or compare the efficiency of a number of different investments.
A situation in an A/B test where the ratio of users or traffic allocated to the control and treatment groups deviates from the intended or planned ratio.
Sample size is the number of observations or data points included in a study or experiment.
Sampling bias occurs when the sample selected for a study or experiment is not representative of the target population, leading to inaccurate or biased results.
A scorecard is a visual tool used to measure and compare the performance of a project, campaign, or strategy against predefined metrics or goals.
The process of dividing a broad consumer or business market, normally consisting of existing and potential customers, into sub-groups of consumers (known as segments).
A sensitivity analysis is a technique used to evaluate the impact of changes in input variables or assumptions on the output or results of a model or analysis.
Sequential testing is a method of conducting an experiment in which data is analysed periodically, and the experiment is stopped or continued based on predefined stopping rules or criteria.
A series of interactions one user takes within a given time frame on your website.
The Shapiro-Wilk test is a statistical test used to determine if a sample of data follows a normal distribution.
A statistical adjustment method used to control for multiple comparisons and reduce the probability of false-positive results (Type I errors) when conducting multiple hypothesis tests.
The signal-to-noise ratio (SNR) is a measure of the strength of a desired signal relative to the level of background noise or interference.
Simpson's paradox is a phenomenon in which a trend or pattern observed in different groups is reversed or contradicted when the groups are combined.
The process of dividing traffic or users into different groups for an A/B test.
The standard deviation is a measure of the dispersion or spread of a dataset around its mean, indicating the typical distance between data points and the mean.
The standard error is a measure of the accuracy or precision of a statistic or estimate, representing the expected deviation of the statistic from the true population parameter.
Statistical power is the probability of correctly rejecting the null hypothesis when the alternative hypothesis is true, or the ability of a test to detect an effect if it exists.
Statistical significance refers to the likelihood that an observed result or effect is not due to chance alone, but rather represents a real or meaningful difference or relationship.
The Student's t-test is a statistical test used to determine if the means of two groups are significantly different from each other.
The primary metric or Key Performance Indicator (KPI) used to evaluate the success of a variation in an A/B test.
A statistical test used to determine whether a new treatment or intervention is superior or better than an existing standard or control.
Survivorship bias is a logical error that occurs when analysis or conclusions are based only on the remaining or surviving entities, ignoring those that did not survive or were eliminated.
The specific metric or Key Performance Indicator (KPI) that is being measured and evaluated in an A/B test or experiment.
A detailed document outlining the objectives, hypotheses, metrics, sample sizes, duration, and other aspects of an A/B test or experiment.
The proportion of traffic or users allocated to each variation in an A/B test.
The group of participants or subjects who receive the experimental or new condition, treatment, or variation in an experiment.
A correct result in which a test correctly identifies the absence of a condition or effect.
A correct result in which a test correctly identifies the presence of a condition or effect.
A statistical test used to determine if the difference between two proportions or percentages is statistically significant.
A statistical test that considers both tails or directions of a distribution, used when the alternative hypothesis does not specify the direction of the effect.
An experiment that does not have enough statistical power to detect a meaningful effect, often due to an insufficient sample size.
A visitor to a website who is counted only once during a specified time period, regardless of how many times they visit the site.
A person's emotions and attitudes about using a particular product, system or service.
A measure of the spread or dispersion of a set of data points around the mean value.
A different version or treatment that is tested against a control group in an A/B test or experiment.
A person or consumer who visits a website.
A variation of the Student's t-test that is used when the two samples have unequal variances or unequal sample sizes.
A non-parametric statistical test used to compare two related or paired samples and determine if there is a significant difference between their median values.
The variation or treatment group that performs better than the control group in an A/B test, based on the defined success metric.
A method of evaluating performance by comparing data from one period to the same period in the previous year.
A measure of how many standard deviations a data point is from the mean of a distribution.
A statistical test used to determine if the means of two populations are significantly different, based on the assumption of a normal distribution.