Scatter Plot Calculator with Linear Regression
Enter your (x, y) data pairs below and this calculator instantly plots the regression line and returns the slope, y-intercept, regression equation, Pearson correlation coefficient (r), coefficient of determination (R-squared), root mean square error (RMSE), and basic descriptive statistics for both variables. Results update as you type.
Formula
Worked example
For 6 pairs: (1,2), (2,4), (3,5), (4,8), (5,9), (6,12): x-bar = 3.5, y-bar = 6.667. SS_xx = 17.5, SS_xy = 29.5, slope m = 29.5/17.5 = 1.6857, intercept b = 6.667 - 1.6857 x 3.5 = 0.767. Regression equation: y = 1.6857x + 0.767. Pearson r = 0.9966, R-squared = 0.9932. The line explains 99.3% of the variance in y.
What is a scatter plot and why does it matter?
A scatter plot graphs paired (x, y) data on a coordinate plane. Each point represents one observation, with x on the horizontal axis and y on the vertical. The pattern of the cloud reveals whether a relationship exists between the two variables: a tight upward-slanting band suggests a positive linear relationship, a downward band suggests a negative one, a circular cloud suggests no relationship, and a curved band suggests a nonlinear one. Scatter plots are the essential first step before any regression analysis because they let you see whether fitting a straight line is even appropriate.
How linear regression and the regression equation work
Linear regression finds the single straight line that minimises the sum of squared vertical distances from each data point to the line (the least-squares criterion). The line has the equation y = mx + b, where m is the slope (the amount y changes for each one-unit increase in x) and b is the y-intercept (the predicted y when x is zero). The slope is calculated as SS_xy divided by SS_xx, and the intercept is y-bar minus m times x-bar, where SS_xy is the sum of (xi minus x-bar)(yi minus y-bar) over all n points, and SS_xx is the sum of squared x deviations. Once you have the equation you can substitute any x value to predict the corresponding y.
Understanding r, R-squared, and RMSE
The Pearson correlation coefficient r measures the strength and direction of the linear relationship on a scale from -1 to +1. A value of +1 means perfect positive correlation (all points on an upward line), -1 means perfect negative correlation, and 0 means no linear pattern. R-squared (the coefficient of determination) is simply r squared, and it tells you what fraction of the total variation in y is explained by the regression line: an R-squared of 0.80 means the line accounts for 80% of the variation. RMSE (root mean square error) is the square root of the average squared difference between actual y values and the values predicted by the line, so it is expressed in the same units as y and gives a practical sense of the typical prediction error.
Limitations and common mistakes
Linear regression assumes a straight-line relationship, which you should verify visually before trusting the output. Outliers can pull the slope significantly; one extreme point can change r from strong to weak. Correlation is not causation: a high r only means the two variables move together, not that one causes the other. With a small sample (fewer than 10 points) even a high r can appear by chance. Always check the scatter plot first, look for obvious curvature or clusters, and treat the regression equation as a starting model rather than a definitive truth.
Interpreting the Pearson correlation coefficient (r)
| |r| range | Strength | Direction | Typical interpretation |
|---|---|---|---|
| 0.90 - 1.00 | Very strong | Positive or negative | Near-perfect linear relationship |
| 0.70 - 0.89 | Strong | Positive or negative | Reliable linear predictor |
| 0.50 - 0.69 | Moderate | Positive or negative | Noticeable trend; other factors matter |
| 0.30 - 0.49 | Weak | Positive or negative | Slight trend; poor prediction |
| 0.00 - 0.29 | Very weak | Positive or negative | Essentially no linear relationship |
These ranges are widely used guidelines. The appropriate threshold depends on the field of study and sample size.
Frequently asked questions
How many data points do I need for linear regression?
You need at least 2 points to fit a line, but with only 2 points r is always exactly 1 (or -1) because you can always draw a perfect line through 2 points. Meaningful inference typically requires at least 10 to 20 points, and the more data you have the more reliable the regression coefficients and correlation will be.
What does a negative slope mean?
A negative slope means that as x increases, the predicted y value decreases. For example, if x is the number of hours of sleep and y is tiredness, a negative slope would indicate that more sleep is associated with less tiredness. A negative slope paired with a negative r confirms a negative linear relationship.
What is the difference between r and R-squared?
Pearson r tells you both the direction (positive or negative) and the strength of the linear relationship, on a scale from -1 to +1. R-squared is r raised to the power of 2, so it is always between 0 and 1 and tells you only the strength: specifically, the proportion of the total variance in y that is explained by the regression line. Use r when you care about direction; use R-squared when you want to compare the explanatory power of different models.
Can I use this calculator for nonlinear data?
This calculator fits a straight line (linear model). If your scatter plot shows a curve, an exponential trend, or a power relationship, the linear equation and r will be misleading. In that case you would need to either transform the data (e.g., take logarithms) or use a nonlinear regression tool. Always look at the scatter plot first to judge whether a straight line is a reasonable model.
What does RMSE tell me that R-squared does not?
R-squared is a dimensionless ratio between 0 and 1, so it is easy to compare across datasets but does not tell you the actual prediction error in real units. RMSE is in the same units as your y variable, so it gives you a concrete sense of how far off a typical prediction will be. For example, if y is temperature in degrees Celsius and RMSE is 2.3, the regression line is typically about 2.3 degrees off for any given prediction.
How do outliers affect the regression line?
Outliers, especially those far from the center of the x-range (called high-leverage points), can pull the slope strongly in their direction and inflate or deflate r dramatically. If you suspect an outlier is distorting your results, try running the regression with and without it, and look at the scatter plot to see whether the line is being unduly influenced.