A/B Test Calculator
Use this A/B test calculator to analyze a live experiment or plan a future one. In Analyze mode, paste your visitor and conversion counts to get the z-score, p-value, relative uplift, and a clear significance verdict. In Plan mode, enter your baseline conversion rate and the minimum detectable effect to find out how many visitors you need per variation. Choose one-sided or two-sided testing and set your confidence level.
What is an A/B test and why does statistical significance matter?
An A/B test (also called a split test) compares two versions of a webpage, email, or product feature to see which one performs better on a measurable goal such as sign-up rate, purchase rate, or click-through rate. Half your traffic sees the original (control, version A) and the other half sees the change (variation, version B). At the end of the test you compare conversion rates. The problem is that random variation in human behavior means B could look better than A by pure chance even if the change does nothing. Statistical significance testing quantifies that risk. A p-value of 0.05 means there is only a 5% chance of seeing a difference this large (or larger) if A and B are truly identical. When p < alpha (your chosen significance threshold) you reject the null hypothesis and conclude that the difference is real, not just noise.
How the calculator works
The calculator uses a two-proportion z-test, the most widely used frequentist test for conversion-rate experiments. Given the visitor and conversion counts for each variation, it computes a pooled standard error, derives a z-score, and converts that to a p-value using the standard normal distribution. For sample size planning, the calculator solves the standard power formula: n = (z_alpha * sqrt(2*p*(1-p)) + z_beta * sqrt(p*(1-p) + p_B*(1-p_B)))^2 / (p_B - p)^2, where p is the baseline rate, p_B is the target rate, z_alpha is the critical z for your chosen significance level, and z_beta comes from your required statistical power. You can switch between one-sided and two-sided hypotheses. Two-sided tests check whether B differs from A in either direction (more conservative). One-sided tests check whether B is specifically better, which gives more power but cannot detect harm.
Sample ratio mismatch (SRM) - the silent killer of A/B tests
A sample ratio mismatch occurs when the observed traffic split between A and B differs substantially from the intended split (usually 50/50). If you intended 5,000 visitors in each group but got 5,000 in A and 4,700 in B, the imbalance suggests a bug in the assignment mechanism, a logging failure, or a bot filtering problem that affected one group more than the other. Results from an SRM test are unreliable even if they look statistically significant. This calculator flags when the ratio of your two groups exceeds 1.05 (more than 5% relative imbalance). If you see that warning, investigate your traffic assignment before drawing conclusions. Common causes include: redirect-based test setups that lose users before the page loads, differences in how bots are filtered across groups, and caching layers that bypass the experiment for one segment.
Minimum detectable effect and test duration planning
Before launching a test, you need to decide what size of improvement is worth detecting. The minimum detectable effect (MDE) is the smallest relative change in conversion rate that matters to your business. A 1% baseline that you want to improve to 1.2% is a 20% relative MDE. Smaller MDEs require larger samples. The sample size formula involves three parameters besides the baseline rate: the significance level (alpha, typically 0.05), statistical power (typically 80-90%), and whether the test is one or two-sided. Lower alpha and higher power both increase the required sample. Once you have the per-variation sample size, divide your daily unique visitors by two (for a 50/50 split) to estimate how many days the test must run. Plan before you start; do not peek at results and stop early just because p < 0.05 - that practice inflates the actual false-positive rate above the stated alpha.
Common significance thresholds and their meaning
| Confidence level | Alpha (type I error) | Typical use case |
|---|---|---|
| 90% | 0.10 | Early-stage or low-stakes tests |
| 95% | 0.05 | Standard for most A/B tests |
| 99% | 0.01 | High-stakes revenue changes |
Alpha is the probability of a false positive (calling a winner when there is none). Power (1-beta) is the probability of detecting a true effect.
Frequently asked questions
What p-value means an A/B test is statistically significant?
A p-value below your chosen significance threshold (alpha) is conventionally considered significant. The most common threshold is 0.05, meaning a 5% chance of a false positive. If your p-value is 0.03, for example, there is only a 3% chance of observing a difference this large if the two versions truly perform the same. Many teams use 95% confidence (alpha = 0.05) for routine tests and 99% (alpha = 0.01) for changes that directly affect revenue or safety.
What is the difference between one-sided and two-sided tests?
A two-sided test checks whether B is different from A in either direction, better or worse. A one-sided test only checks whether B is better. One-sided tests have more statistical power (you can detect smaller effects with the same sample) but they cannot catch harm. If there is any realistic chance that the change could hurt your metric, use a two-sided test. One-sided tests are appropriate only when you are certain the direction of any effect must be positive.
How long should I run an A/B test?
Run your test until you reach the pre-calculated sample size, not until you see a significant result. Stopping the moment significance is reached (peeking) inflates the false-positive rate well above the stated alpha. As a rule of thumb, run tests for at least one full business week to capture weekly seasonality, and never stop after fewer than 100 conversions in the smaller group. Use Plan mode to calculate the exact target sample size before you start.
What is statistical power and why does it matter?
Statistical power (1-beta) is the probability that your test will detect a true improvement of the specified size. At 80% power you will miss 20% of real effects of that magnitude. Increasing power to 90% reduces missed opportunities but requires about 30% more visitors. Low power is a common reason teams conclude a test had "no effect" when the effect was real but the experiment was not large enough to see it. Run the sample size calculator before launching to ensure your test is adequately powered.
What is a minimum detectable effect (MDE) and how do I choose one?
The MDE is the smallest change in conversion rate worth detecting. It is usually expressed as a relative percentage of the baseline (e.g., 10% relative means 4.0% baseline to 4.4%). A smaller MDE requires a much larger sample because tiny differences are harder to distinguish from noise. Choose your MDE based on what improvement would be worth shipping: if a change producing less than 5% relative lift would not change any business decision, set MDE = 5% and save yourself weeks of test time.
Can I test more than two variants at once?
Testing A vs B vs C (a multi-variant or MVT test) is possible, but this calculator covers the two-variant case. When running more than two groups, each additional comparison increases the chance of a false positive (the multiple comparisons problem). You must apply a correction such as Bonferroni or Holm-Bonferroni, or use a dedicated multi-arm testing framework. Splitting traffic across three groups also increases the required total sample size.
What is sample ratio mismatch and what should I do if I see it?
A sample ratio mismatch (SRM) means the actual split of visitors between A and B is different from the intended split. It usually signals a technical bug in the experiment setup: a redirect that some users do not follow, a caching layer skipping the experiment, or bots being filtered differently across groups. Any SRM makes your results unreliable. Stop the test, fix the root cause, and restart the experiment from scratch.