Linear Regression Calculator
Paste paired x and y values to get the full ordinary least-squares result: slope, intercept, the equation y = mx + b, r and r², standard errors, t-statistics, p-values, the overall F-test, and an optional predicted y with confidence and prediction intervals. A step-by-step panel and residuals table show every calculation.
Formula
Worked example
For x = 1, 2, 3, 4, 5 and y = 2, 4, 5, 4, 6: x-bar = 3, y-bar = 4.2. Sxy = 8, Sxx = 10, so m = 0.8 and b = 4.2 - 0.8 x 3 = 1.8. The line is y = 0.8x + 1.8. SSE = 1.96, Se = sqrt(1.96/3) = 0.808. SE(m) = 0.808 / sqrt(10) = 0.256. t = 0.8 / 0.256 = 3.13 (p = 0.052 for n = 5, df = 3).
What least-squares regression finds
Linear regression fits a straight line through paired data so the line summarizes how the response y changes with the predictor x. The "least squares" line is the unique line that minimizes the sum of the squared vertical distances between each observed point and the line. Squaring the residuals keeps positive and negative gaps from cancelling and penalizes large misses more heavily. The result is two numbers, slope and intercept, that describe the overall trend and let you predict y for any new x within your measured range.
Slope, intercept, r and r-squared
The slope m equals Sxy divided by Sxx, where Sxy is the sum of the cross-products of the deviations from the mean and Sxx is the sum of the squared x deviations. This ratio is the covariance of x and y scaled by the variance of x. The intercept b = y-bar - m * x-bar guarantees the line passes through the point of averages. The Pearson correlation r is Sxy divided by the square root of the product Sxx * Syy, and r-squared is its square: the fraction of the variation in y that the line accounts for.
Standard error, t-statistic and p-value
The standard error of the estimate Se = sqrt(SSE / (n-2)) measures how far the observed y values scatter around the fitted line in y units. From Se you can compute the standard error of the slope: SE(m) = Se / sqrt(Sxx). Dividing the slope by its standard error gives the t-statistic, which follows a t-distribution with n-2 degrees of freedom under the null hypothesis that the true slope is zero. The two-tailed p-value from that t-distribution tells you how likely a slope at least this large would be if there were truly no linear relationship. The F-statistic for simple linear regression equals t-squared and tests the same hypothesis from an ANOVA perspective.
Confidence intervals and prediction intervals
The confidence interval for the mean response at a given x estimates where the average y would fall across many repeated samples drawn from the same population. The prediction interval is wider: it covers where a single new observation is likely to fall. Both intervals widen as you move x further from x-bar, reflecting the greater uncertainty of extrapolation. The significance level you choose (default 5%) determines the multiplier: a 95% confidence interval uses the t critical value for alpha = 0.05 and df = n-2.
Reading the results and their limits
A high r-squared means the line explains much of the variance in y, but it does not confirm that a line is the right model. Curved data, influential outliers, or a restricted x range can inflate or distort r-squared. Always inspect the residuals table: if the residuals show a systematic curve or fan out as x increases, a linear model is the wrong choice. Regression describes association, not causation. The intercept is only physically meaningful when x = 0 is a sensible value, and predictions far outside your data range are unreliable no matter how good the fit looks.
Interpreting r-squared (coefficient of determination)
| r-squared range | Variation explained | Fit quality |
|---|---|---|
| 0.90 to 1.00 | 90% to 100% | Very strong |
| 0.70 to 0.90 | 70% to 90% | Strong |
| 0.50 to 0.70 | 50% to 70% | Moderate |
| 0.30 to 0.50 | 30% to 50% | Weak |
| 0.00 to 0.30 | 0% to 30% | Poor |
A rough guide only. Context and field conventions vary considerably.
Frequently asked questions
What is the difference between the slope and r-squared?
The slope tells you the direction and steepness of the relationship: how many units y changes for each one-unit change in x. The r-squared tells you how tightly the points cluster around that line. You can have a steep slope with a poor fit (lots of scatter) or a gentle slope with a near-perfect fit, so both numbers answer different questions and should be read together.
What is the difference between a confidence interval and a prediction interval?
A confidence interval for the mean response at a chosen x estimates the range where the true average y lies across many samples at that x. A prediction interval is wider because it also accounts for the scatter of individual observations around the mean. In practice the prediction interval is what you want when predicting a single future measurement, while the confidence interval is appropriate when estimating the expected (average) response.
What does the p-value for the slope mean?
The p-value for the slope answers: "If the true slope were zero, how likely is it that random sampling alone would produce a slope at least as large as the one I observed?" A small p-value (say, below 0.05) is evidence against the null hypothesis of no linear relationship. It does not prove causation or guarantee the model is correct, it simply says the slope is unlikely to be zero by chance given your data.
How many data points do I need?
You need at least three data points before this calculator returns meaningful inference statistics (slope, standard error, t and p), because two points define the line perfectly and leave zero degrees of freedom for error. For reliable, stable estimates you generally want 10 or more observations, and the confidence intervals will automatically reflect the extra uncertainty when n is small.
When should I force the intercept to zero?
Force b = 0 only when theory or physics demands that y must equal zero when x equals zero, for example, distance traveled when time is zero must be zero. In most practical situations, forcing the intercept to zero is wrong: it biases the slope estimate and inflates r-squared in ways that make the fit look better than it actually is. Leave the intercept free unless you have a strong theoretical justification.
Why does my r-squared look high even though the line looks wrong?
A high r-squared only means the straight line explains much of the variance in y; it does not confirm that a line is the right model. Curved data, a few extreme outliers, or a restricted x range can all inflate or distort r-squared. Always plot your data and inspect the residuals table. If the residuals bend systematically away from zero, a linear model is the wrong choice no matter how large r-squared appears.