Math

Scatter Plot Calculator with Linear Regression

Enter your (x, y) data pairs below and this calculator instantly plots the regression line and returns the slope, y-intercept, regression equation, Pearson correlation coefficient (r), coefficient of determination (R-squared), root mean square error (RMSE), and basic descriptive statistics for both variables. Results update as you type.

By Dr. Rajiv Menon, PhD · Updated June 7, 2026

Regression equationVery strong positive

y = 1.9429x - 0.1333

Best-fit line in the form y = mx + b

Slope (m)1.9429

Y-intercept (b)-0.1333

Pearson r0.9905

R-squared0.981

RMSE0.4612

Data points (n)6

x mean3.5

y mean6.6667

x std dev1.7078

y std dev3.35

Pearson r0.9905

R-squared0.981

Data points
Regression line

Very strong positive correlation (r = 0.9905)

The correlation coefficient r = 0.9905, indicating a very strong positive linear relationship.
R-squared = 0.9810: the regression line explains 98.1% of the variation in y.
RMSE = 0.4612, the typical distance of a data point from the regression line in y-units.

Next stepCheck the scatter plot visually to confirm the relationship is truly linear before relying on the regression equation.

Count the data pointsn = 6
6 pairs
Calculate x mean and y meanx-bar = sum(x) / n, y-bar = sum(y) / n
x-bar = 3.5000, y-bar = 6.6667
Calculate SS_xx, SS_xy, SS_yy (sums of squared deviations)SS_xx = sum((xi - x-bar)^2), SS_xy = sum((xi - x-bar)(yi - y-bar))
SS_xx = 17.5000, SS_xy = 34.0000, SS_yy = 67.3333
Calculate slope m = SS_xy / SS_xxm = 34.0000 / 17.5000
m = 1.9429
Calculate y-intercept b = y-bar - m * x-barb = 6.6667 - 1.9429 x 3.5000
b = -0.1333
Write the regression equationy = m*x + b
y = 1.9429x - 0.1333
Calculate Pearson r = SS_xy / sqrt(SS_xx * SS_yy)r = 34.0000 / sqrt(17.5000 x 67.3333)
r = 0.9905
Calculate R-squared = r^2R^2 = 0.9905^2
R^2 = 0.9810 (98.1% of variance explained)
Calculate RMSE = sqrt(mean of squared residuals)RMSE = sqrt(sum((yi - y-hat_i)^2) / n)
RMSE = 0.4612

Formula

m = \dfrac{\sum(x_i - \bar{x})(y_i - \bar{y})}{\sum(x_i - \bar{x})^2}, \quad b = \bar{y} - m\bar{x}, \quad r = \dfrac{\sum(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum(x_i - \bar{x})^2 \cdot \sum(y_i - \bar{y})^2}}

Worked example

For 6 pairs: (1,2), (2,4), (3,5), (4,8), (5,9), (6,12): x-bar = 3.5, y-bar = 6.667. SS_xx = 17.5, SS_xy = 29.5, slope m = 29.5/17.5 = 1.6857, intercept b = 6.667 - 1.6857 x 3.5 = 0.767. Regression equation: y = 1.6857x + 0.767. Pearson r = 0.9966, R-squared = 0.9932. The line explains 99.3% of the variance in y.

What is a scatter plot and why does it matter?

A scatter plot graphs paired (x, y) data on a coordinate plane. Each point represents one observation, with x on the horizontal axis and y on the vertical. The pattern of the cloud reveals whether a relationship exists between the two variables: a tight upward-slanting band suggests a positive linear relationship, a downward band suggests a negative one, a circular cloud suggests no relationship, and a curved band suggests a nonlinear one. Scatter plots are the essential first step before any regression analysis because they let you see whether fitting a straight line is even appropriate.

How linear regression and the regression equation work

Linear regression finds the single straight line that minimises the sum of squared vertical distances from each data point to the line (the least-squares criterion). The line has the equation y = mx + b, where m is the slope (the amount y changes for each one-unit increase in x) and b is the y-intercept (the predicted y when x is zero). The slope is calculated as SS_xy divided by SS_xx, and the intercept is y-bar minus m times x-bar, where SS_xy is the sum of (xi minus x-bar)(yi minus y-bar) over all n points, and SS_xx is the sum of squared x deviations. Once you have the equation you can substitute any x value to predict the corresponding y.

Understanding r, R-squared, and RMSE

The Pearson correlation coefficient r measures the strength and direction of the linear relationship on a scale from -1 to +1. A value of +1 means perfect positive correlation (all points on an upward line), -1 means perfect negative correlation, and 0 means no linear pattern. R-squared (the coefficient of determination) is simply r squared, and it tells you what fraction of the total variation in y is explained by the regression line: an R-squared of 0.80 means the line accounts for 80% of the variation. RMSE (root mean square error) is the square root of the average squared difference between actual y values and the values predicted by the line, so it is expressed in the same units as y and gives a practical sense of the typical prediction error.

Limitations and common mistakes

Linear regression assumes a straight-line relationship, which you should verify visually before trusting the output. Outliers can pull the slope significantly; one extreme point can change r from strong to weak. Correlation is not causation: a high r only means the two variables move together, not that one causes the other. With a small sample (fewer than 10 points) even a high r can appear by chance. Always check the scatter plot first, look for obvious curvature or clusters, and treat the regression equation as a starting model rather than a definitive truth.

Interpreting the Pearson correlation coefficient (r)

\|r\| range	Strength	Direction	Typical interpretation
0.90 - 1.00	Very strong	Positive or negative	Near-perfect linear relationship
0.70 - 0.89	Strong	Positive or negative	Reliable linear predictor
0.50 - 0.69	Moderate	Positive or negative	Noticeable trend; other factors matter
0.30 - 0.49	Weak	Positive or negative	Slight trend; poor prediction
0.00 - 0.29	Very weak	Positive or negative	Essentially no linear relationship

These ranges are widely used guidelines. The appropriate threshold depends on the field of study and sample size.

Frequently asked questions

How many data points do I need for linear regression?

You need at least 2 points to fit a line, but with only 2 points r is always exactly 1 (or -1) because you can always draw a perfect line through 2 points. Meaningful inference typically requires at least 10 to 20 points, and the more data you have the more reliable the regression coefficients and correlation will be.

What does a negative slope mean?

A negative slope means that as x increases, the predicted y value decreases. For example, if x is the number of hours of sleep and y is tiredness, a negative slope would indicate that more sleep is associated with less tiredness. A negative slope paired with a negative r confirms a negative linear relationship.

What is the difference between r and R-squared?

Pearson r tells you both the direction (positive or negative) and the strength of the linear relationship, on a scale from -1 to +1. R-squared is r raised to the power of 2, so it is always between 0 and 1 and tells you only the strength: specifically, the proportion of the total variance in y that is explained by the regression line. Use r when you care about direction; use R-squared when you want to compare the explanatory power of different models.

Can I use this calculator for nonlinear data?

This calculator fits a straight line (linear model). If your scatter plot shows a curve, an exponential trend, or a power relationship, the linear equation and r will be misleading. In that case you would need to either transform the data (e.g., take logarithms) or use a nonlinear regression tool. Always look at the scatter plot first to judge whether a straight line is a reasonable model.

What does RMSE tell me that R-squared does not?

R-squared is a dimensionless ratio between 0 and 1, so it is easy to compare across datasets but does not tell you the actual prediction error in real units. RMSE is in the same units as your y variable, so it gives you a concrete sense of how far off a typical prediction will be. For example, if y is temperature in degrees Celsius and RMSE is 2.3, the regression line is typically about 2.3 degrees off for any given prediction.

How do outliers affect the regression line?

Outliers, especially those far from the center of the x-range (called high-leverage points), can pull the slope strongly in their direction and inflate or deflate r dramatically. If you suspect an outlier is distorting your results, try running the regression with and without it, and look at the scatter plot to see whether the line is being unduly influenced.

Sources

Was this calculator helpful?

Written by Dr. Rajiv Menon, PhD Applied Mathematician · Bengaluru, India

Applied mathematician bridging algebraic theory and computational tools for students, engineers, and everyday problem-solvers.

How we build & check our calculators