Skip to content
Statistics

Linear Regression Calculator

Paste paired x and y values to get the full ordinary least-squares result: slope, intercept, the equation y = mx + b, r and r², standard errors, t-statistics, p-values, the overall F-test, and an optional predicted y with confidence and prediction intervals. A step-by-step panel and residuals table show every calculation.

Your details

The independent (predictor) values, separated by commas.
The dependent (response) values, in the same order as the x values.
Used to compute confidence and prediction intervals and to flag statistical significance.
Enter an x value to get the predicted y, confidence interval for the mean, and prediction interval for a single new observation.
When enabled the line is constrained to pass through the origin: y = mx. Use only when theory demands it.
Slope (m)Strong fit
0.8
Intercept (b)1.8
r² (coefficient of determination)0.7273
Correlation (r)0.8528
Std error of estimate (Se)0.8944
t-statistic (slope)2.8284
p-value (slope)0.009
F-statistic (overall)8
p-value (F-test)0.009
Data points (n)5
Predicted y4.2
Confidence interval (low)-15.8
Confidence interval (high)24.2
Prediction interval (low)-44.7898
Prediction interval (high)53.1898
0.7273 r^2
Poor fit<0.3Weak fit0.3-0.5Moderate fit0.5-0.7Strong fit0.7-0.9Very strong fit0.9+
03.066.12135
x
  • Observed data
  • Fitted line

Best-fit line: y = 0.8x + 1.8 (r^2 = 0.7273, n = 5).

  • The slope of 0.8 means y rises by about 0.8 for every one-unit increase in x.
  • r^2 = 0.7273 - the line explains roughly 72.7% of the variation in y; the rest is unexplained scatter.
  • The standard error of the estimate is 0.8944, the typical distance between an observed y and the fitted line.
  • The overall F-test is statistically significant (p < 0.05), meaning the slope is unlikely to be zero by chance.
  • With only 5 points the estimates are imprecise; more data will narrow the intervals.

Next stepAt your chosen x, y-hat = 4.2. The 95% confidence interval for the mean response is [-15.8, 24.2] and the prediction interval for a new single observation is [-44.7898, 53.1898].

Residuals table

xy (observed)y-hat (fitted)ResidualResidual^2
122.6-0.60.36
243.40.60.36
354.20.80.64
445-11
565.80.20.04
SSE2.4

SSE is the residual sum of squares; Se = sqrt(SSE / (n-2)).

Formula

m=(xixˉ)(yiyˉ)(xixˉ)2,b=yˉmxˉ,r2=Sxy2SxxSyy,Se=SSEn2,t=mSe/Sxxm = \dfrac{\sum (x_i-\bar{x})(y_i-\bar{y})}{\sum (x_i-\bar{x})^2},\quad b = \bar{y} - m\bar{x},\quad r^2 = \dfrac{S_{xy}^2}{S_{xx}S_{yy}},\quad S_e = \sqrt{\dfrac{\mathrm{SSE}}{n-2}},\quad t = \dfrac{m}{S_e/\sqrt{S_{xx}}}

Worked example

For x = 1, 2, 3, 4, 5 and y = 2, 4, 5, 4, 6: x-bar = 3, y-bar = 4.2. Sxy = 8, Sxx = 10, so m = 0.8 and b = 4.2 - 0.8 x 3 = 1.8. The line is y = 0.8x + 1.8. SSE = 1.96, Se = sqrt(1.96/3) = 0.808. SE(m) = 0.808 / sqrt(10) = 0.256. t = 0.8 / 0.256 = 3.13 (p = 0.052 for n = 5, df = 3).

What least-squares regression finds

Linear regression fits a straight line through paired data so the line summarizes how the response y changes with the predictor x. The "least squares" line is the unique line that minimizes the sum of the squared vertical distances between each observed point and the line. Squaring the residuals keeps positive and negative gaps from cancelling and penalizes large misses more heavily. The result is two numbers, slope and intercept, that describe the overall trend and let you predict y for any new x within your measured range.

Slope, intercept, r and r-squared

The slope m equals Sxy divided by Sxx, where Sxy is the sum of the cross-products of the deviations from the mean and Sxx is the sum of the squared x deviations. This ratio is the covariance of x and y scaled by the variance of x. The intercept b = y-bar - m * x-bar guarantees the line passes through the point of averages. The Pearson correlation r is Sxy divided by the square root of the product Sxx * Syy, and r-squared is its square: the fraction of the variation in y that the line accounts for.

Standard error, t-statistic and p-value

The standard error of the estimate Se = sqrt(SSE / (n-2)) measures how far the observed y values scatter around the fitted line in y units. From Se you can compute the standard error of the slope: SE(m) = Se / sqrt(Sxx). Dividing the slope by its standard error gives the t-statistic, which follows a t-distribution with n-2 degrees of freedom under the null hypothesis that the true slope is zero. The two-tailed p-value from that t-distribution tells you how likely a slope at least this large would be if there were truly no linear relationship. The F-statistic for simple linear regression equals t-squared and tests the same hypothesis from an ANOVA perspective.

Confidence intervals and prediction intervals

The confidence interval for the mean response at a given x estimates where the average y would fall across many repeated samples drawn from the same population. The prediction interval is wider: it covers where a single new observation is likely to fall. Both intervals widen as you move x further from x-bar, reflecting the greater uncertainty of extrapolation. The significance level you choose (default 5%) determines the multiplier: a 95% confidence interval uses the t critical value for alpha = 0.05 and df = n-2.

Reading the results and their limits

A high r-squared means the line explains much of the variance in y, but it does not confirm that a line is the right model. Curved data, influential outliers, or a restricted x range can inflate or distort r-squared. Always inspect the residuals table: if the residuals show a systematic curve or fan out as x increases, a linear model is the wrong choice. Regression describes association, not causation. The intercept is only physically meaningful when x = 0 is a sensible value, and predictions far outside your data range are unreliable no matter how good the fit looks.

Interpreting r-squared (coefficient of determination)

r-squared rangeVariation explainedFit quality
0.90 to 1.0090% to 100% Very strong
0.70 to 0.9070% to 90% Strong
0.50 to 0.7050% to 70% Moderate
0.30 to 0.5030% to 50% Weak
0.00 to 0.300% to 30% Poor

A rough guide only. Context and field conventions vary considerably.

Frequently asked questions

What is the difference between the slope and r-squared?

The slope tells you the direction and steepness of the relationship: how many units y changes for each one-unit change in x. The r-squared tells you how tightly the points cluster around that line. You can have a steep slope with a poor fit (lots of scatter) or a gentle slope with a near-perfect fit, so both numbers answer different questions and should be read together.

What is the difference between a confidence interval and a prediction interval?

A confidence interval for the mean response at a chosen x estimates the range where the true average y lies across many samples at that x. A prediction interval is wider because it also accounts for the scatter of individual observations around the mean. In practice the prediction interval is what you want when predicting a single future measurement, while the confidence interval is appropriate when estimating the expected (average) response.

What does the p-value for the slope mean?

The p-value for the slope answers: "If the true slope were zero, how likely is it that random sampling alone would produce a slope at least as large as the one I observed?" A small p-value (say, below 0.05) is evidence against the null hypothesis of no linear relationship. It does not prove causation or guarantee the model is correct, it simply says the slope is unlikely to be zero by chance given your data.

How many data points do I need?

You need at least three data points before this calculator returns meaningful inference statistics (slope, standard error, t and p), because two points define the line perfectly and leave zero degrees of freedom for error. For reliable, stable estimates you generally want 10 or more observations, and the confidence intervals will automatically reflect the extra uncertainty when n is small.

When should I force the intercept to zero?

Force b = 0 only when theory or physics demands that y must equal zero when x equals zero, for example, distance traveled when time is zero must be zero. In most practical situations, forcing the intercept to zero is wrong: it biases the slope estimate and inflates r-squared in ways that make the fit look better than it actually is. Leave the intercept free unless you have a strong theoretical justification.

Why does my r-squared look high even though the line looks wrong?

A high r-squared only means the straight line explains much of the variance in y; it does not confirm that a line is the right model. Curved data, a few extreme outliers, or a restricted x range can all inflate or distort r-squared. Always plot your data and inspect the residuals table. If the residuals bend systematically away from zero, a linear model is the wrong choice no matter how large r-squared appears.

Sources

Written by Dr. Hannah Brandt, PhD Statistician · Munich, Germany

Applied statistician translating rigorous probability theory into clear, accurate tools for researchers and practitioners.

Search 3,500+ calculators

Loading search…