Skip to content
Statistics

A/B Test Calculator

Enter the visitor and conversion counts for your control (A) and variation (B) to instantly get the p-value, z-score, relative uplift, and observed statistical power. Choose your confidence level and hypothesis type, then check the "show your work" panel to see every step of the calculation. The planning section tells you how many visitors you need before you start.

Your details

Switch between analysing a finished test and calculating required sample size.
Total unique visitors (or sessions) shown the control variant.
Number of visitors who converted (clicked, signed up, purchased, etc.) on the control.
Total unique visitors (or sessions) shown the variation.
Number of visitors who converted on the variation.
The probability that the result is not due to chance, if repeated many times.
Two-sided detects any difference (positive or negative). One-sided tests only for improvement; it reaches significance faster but cannot catch regressions.
Relative upliftSignificant improvement
0.2%

Percentage change in conversion rate from A to B

Conversion rate A0.04%
Conversion rate B0.05%
Z-score1.95
P-value0.0511
Statistically significantYes - statistically significant
Observed power1%
Standard error A0.00277
Standard error B0.00302
Sample size per variation-
Total visitors needed-
Target conversion rate B-
Control (A)0.04%
Variation (B)0.05%

Relative uplift: 0.2%

  • Conversion rate
  • Standard error
071.73143.45346
Conversion rate (%)
  • Control (A)
  • Variation (B)

Result is statistically significant at 95% confidence.

  • Control conversion rate: 4.00%. Variation conversion rate: 4.80%.
  • Observed power: 99%.
  • The variation shows a 20.0% relative improvement over the control.

Next stepConsider rolling out the variation and monitoring long-term performance to confirm the lift holds outside the test window.

What is an A/B test and why does statistical significance matter?

An A/B test (also called a split test or randomised controlled experiment) randomly divides your audience into two groups. Group A sees the original (control) and group B sees the changed version (variation). By measuring the conversion rate of each group, you can assess whether any difference in performance is real or just random variation. Statistical significance answers the question: how likely is it that a result this large, or more extreme, would appear purely by chance if the two versions were actually identical? A result below the significance threshold (typically p < 0.05 for a 95% confidence level) is considered unlikely to be due to chance.

How the calculator works: the two-proportion z-test

This calculator uses the two-proportion z-test, the standard frequentist method for comparing conversion rates. The pooled conversion rate (total conversions divided by total visitors) is used to estimate the standard error of the difference under the null hypothesis that the two rates are equal. The z-score is the distance between the two observed rates, measured in those standard error units. Converting the z-score through the standard normal distribution gives the p-value - the probability of seeing a z-score this large or larger under the null. The result is statistically significant when the p-value is smaller than 1 minus your chosen confidence level (the significance threshold, alpha). The calculator supports both two-sided (detecting any difference) and one-sided (detecting only improvement) hypothesis types.

Statistical power and why underpowered tests mislead you

Statistical power is the probability that your test will detect a real effect if one exists. A power of 80% means that if variation B truly converts 20% better than control, there is an 80% chance your test will reach significance. Low power (below 70-80%) makes false negatives common: you will frequently conclude "no difference" even when the variation really is better. Observed power in the evaluate tab reflects the power the test actually had, given the sample sizes and the observed effect. If observed power is low and the result is not significant, the test is likely underpowered, not conclusive evidence that there is no effect. The plan tab lets you calculate the correct sample size before you start, using the Fleiss formula, which ensures your test has enough visitors to detect your target effect reliably.

Avoiding common A/B testing mistakes

Peeking at results before the required sample size is reached inflates the false positive rate. Each intermediate check is a new opportunity for a random fluctuation to cross the significance threshold, so running until it looks significant guarantees a misleading result over time. Stop the test only after collecting the planned number of visitors. Be careful about one-sided hypotheses: they reach significance with less data, but they cannot detect regressions, so a harmful variation can slip through. Use two-sided tests unless you have a very strong prior reason to expect only improvement. Finally, watch for sample ratio mismatch (SRM): if visitors are not split as expected (for example, 5000 in each group when you specified 50/50), something is wrong with your randomisation or tracking, and the result should not be trusted regardless of significance.

Confidence level and significance threshold reference

Confidence levelAlpha (two-sided)Critical z (two-sided)Critical z (one-sided)
90% 0.101.6451.282
95% 0.051.9601.645
99% 0.012.5762.326

Standard frequentist thresholds. The two-sided p-value must fall below alpha for significance; one-sided halves the alpha.

Frequently asked questions

What p-value is needed for statistical significance?

At the standard 95% confidence level, you need a p-value below 0.05 (5%). At 90% confidence the threshold is 0.10; at 99% confidence it is 0.01. The p-value is the probability of observing a difference this large or larger by chance alone if the two variants are truly identical. Smaller is stronger evidence against the null hypothesis.

How many visitors do I need for an A/B test?

It depends on three things: your current conversion rate (baseline), the smallest effect you want to reliably detect (minimum detectable effect, or MDE), and the power and confidence level you choose. Use the "Plan" mode in this calculator to get the required sample size per variation. Typically, low-traffic sites need to run tests for weeks to collect enough data, while high-traffic sites can finish in days.

What is the difference between one-sided and two-sided tests?

A two-sided test checks whether variation B differs from control A in either direction. A one-sided test checks only whether B is better than A. One-sided tests are more sensitive (they reach significance with less data) but they cannot detect cases where the variation performs worse. Use two-sided tests in most situations unless you have a documented, pre-registered reason to expect only an improvement.

What is statistical power and what level should I target?

Power is the probability of detecting a real effect when one exists. The conventional minimum is 80%, which means there is a 20% chance of a false negative (missing a real improvement). Aiming for 80-90% power is recommended for most conversion optimisation tests. If your test comes back not significant but the observed power is below 70%, treat the result as inconclusive rather than evidence that the variations are equivalent.

Can I look at the results before the test is finished?

Peeking at a running test and stopping early when you see significance is one of the most common mistakes in A/B testing. Every intermediate look is an additional chance to see a spurious result, inflating your true false-positive rate well above the stated confidence level. Decide your required sample size before the test, run until you reach it, then read the result once.

What does relative uplift mean?

Relative uplift is the percentage change in conversion rate from control to variation, calculated as (rate_B - rate_A) / rate_A. A 4% control rate rising to 4.8% is a 20% relative uplift even though the absolute increase is only 0.8 percentage points. Relative uplift is used in sample size planning because it stays proportional: a 20% relative lift is harder to detect on a 1% base rate than on a 10% base rate.

Sources

Written by Dr. Hannah Brandt, PhD Statistician · Munich, Germany

Applied statistician translating rigorous probability theory into clear, accurate tools for researchers and practitioners.

Search 3,500+ calculators

Loading search…