ResourcesMathematicsRegression & Correlation
Mathematics (Statistics AP)High School

Regression & Correlation

Regression and correlation are fundamental statistical tools used to understand and quantify the relationship between two or more variables. Correlation measures the strength and direction of a linear relationship, while regression models that relationship to make predictions.

This guide covers the correlation coefficient, coefficient of determination, least squares regression, scatter plots, residuals, and extrapolation -- with step-by-step worked examples and a 10-question practice quiz.

1Introduction

In statistics, we often want to know: Do two variables move together? If so, how strongly? And can we use one variable to predict the other? These are the questions that correlation and regression answer.

Correlation quantifies the linear relationship between two quantitative variables, while regression provides a mathematical equation for prediction. Together, they form the foundation of predictive analytics used in fields from economics to medicine.

Why This Matters

Regression analysis is used everywhere: predicting house prices, modeling stock returns, understanding how study hours affect grades, forecasting sales, and evaluating medical treatments. It is one of the most practical statistical tools you will learn.

Correlation

Measures the strength and direction of a linear relationship. Ranges from -1 to +1.

Regression

Models the relationship as an equation, allowing you to predict Y from X.

2Key Definitions

Correlation Coefficient (r)

A numerical measure of the strength and direction of the linear relationship between two quantitative variables. Ranges from -1 to +1.

Coefficient of Determination (r2)

The proportion of variance in Y explained by X. Ranges from 0 to 1. A higher value means a better fit.

Slope (b1)

The estimated change in Y for every one-unit increase in X. A positive slope means Y increases as X increases.

Intercept (b0)

The estimated value of Y when X is zero. Should be interpreted cautiously if X = 0 is outside the data range.

Residual

The difference between observed Y and predicted Y-hat: e = Y - Y-hat.

Least Squares

The method that finds the line minimizing the sum of squared residuals (vertical distances from points to the line).

Extrapolation

Predicting Y for X values outside the observed data range. Generally unreliable and risky.

3Correlation Coefficient (r)

The correlation coefficient (r) quantifies the strength and direction of a linear relationship between two quantitative variables. It always falls between -1 and +1.

Interpreting r Values

r = +1Perfect positive linear relationship
0.7 to 0.9Strong positive linear relationship
0.3 to 0.5Moderate positive linear relationship
r = 0No linear relationship
-0.7 to -0.9Strong negative linear relationship
r = -1Perfect negative linear relationship

Strength vs. Direction

The sign of r tells you the direction (positive or negative). The absolute value tells you the strength. So r = -0.9 is a stronger relationship than r = 0.5, even though it is negative.

4Coefficient of Determination (r2)

The coefficient of determination (r2) is simply the square of the correlation coefficient. It tells you what percentage of the variation in the dependent variable (Y) is explained by the independent variable (X).

How to Interpret r2

Use this template: "r2 percent of the variation in [Y] can be explained by the linear relationship with [X]."

Example: If r2 = 0.64, then 64% of the variation in exam scores can be explained by hours studied. The remaining 36% is due to other factors or random variation.

r2 is the "Percentage Explained"

Think of r2 as how much of the "story" (variation in Y) is told by X. A higher r2 means X is a better predictor of Y. But remember -- it does NOT mean the remaining percentage is "error." It simply means other factors are at play.

5The Regression Line

The Least Squares Regression Line (LSRL) is the straight line that best fits the data by minimizing the sum of squared residuals. Its equation is: Y-hat = b0 + b1 * x.

Calculating the Slope and Intercept

Slope (b1)

b1 = [n(Sum of xy) - (Sum of x)(Sum of y)] / [n(Sum of x2) - (Sum of x)2]

Intercept (b0)

b0 = y-bar - b1 * x-bar

(where y-bar = mean of Y, x-bar = mean of X)

Interpreting the Line

Slope Interpretation

For every one-unit increase in X, the predicted value of Y changes by b1 units. Example: "For every additional hour studied, the predicted exam score increases by 7 points."

Intercept Interpretation

When X = 0, the predicted Y is b0. Be cautious -- if X = 0 is outside your data range, this value may not be meaningful in context.

The Regression Line Always Passes Through (x-bar, y-bar)

This is a key property of the least squares line. The point of means is always on the line, which can help you verify your calculations.

6Scatter Plots

A scatter plot is a graphical representation of paired data that shows the relationship between two quantitative variables. The independent variable (X) is plotted on the horizontal axis, and the dependent variable (Y) on the vertical axis.

What to Look For

Direction

Is the general pattern upward (positive) or downward (negative)?

Form

Is the relationship linear, curved, or no clear pattern?

Strength

How closely do the points cluster around a line or curve?

Outliers

Are there points far from the general pattern? They can heavily influence results.

Always Plot Your Data First

Before computing r or fitting a regression line, always create a scatter plot. It reveals the form of the relationship, potential outliers, and whether a linear model is appropriate. Numbers alone can be misleading.

7Residuals & Extrapolation

Residuals

A residual is the vertical distance between an observed data point and the regression line: e = Y - Y-hat. Positive residuals mean the point is above the line; negative residuals mean it is below.

Residual Plots

A residual plot graphs residuals (y-axis) against predicted values or X values (x-axis). In a good model, residuals should be randomly scattered around zero with no discernible pattern.

Random scatter: The linear model is appropriate.

Curved pattern: A non-linear model may be more appropriate.

Fanning out: The spread of residuals is not constant (heteroscedasticity).

Extrapolation

Extrapolation means predicting Y for X values outside the range of your observed data. This is generally unreliable and dangerous because the linear trend observed in your data may not continue beyond it.

Extrapolation Warning

If your data covers ages 5-15 and you use the regression line to predict height at age 50, the prediction will almost certainly be wrong. The linear growth pattern does not continue indefinitely.

8Worked Examples

Example 1: Calculating r and the Regression Line

Find the correlation coefficient and regression line for 5 students' hours studied (X) and exam scores (Y).

StudentX (Hours)Y (Score)XYX2Y2
126012043600
237522595625
3480320166400
4585425257225
5690540368100
Sum2039016309030950

n = 5

Correlation coefficient:

r = [5(1630) - (20)(390)] / sqrt[(5 x 90 - 400)(5 x 30950 - 152100)]

r = 350 / sqrt(50 x 2650) = 350 / 364.0 = 0.96

Interpretation: Very strong positive linear correlation between hours studied and exam scores.

Example 2: Regression Line and Prediction

Using the same data, find the regression line and predict the score for 4.5 hours of study.

Slope: b1 = [5(1630) - (20)(390)] / [5(90) - (20)2] = 350 / 50 = 7

Means: x-bar = 20/5 = 4, y-bar = 390/5 = 78

Intercept: b0 = 78 - 7(4) = 78 - 28 = 50

Regression equation: Y-hat = 50 + 7x

Prediction for x = 4.5: Y-hat = 50 + 7(4.5) = 50 + 31.5 = 81.5

r2 = (0.96)2 = 0.92 -- approximately 92% of the variation in exam scores is explained by hours studied.

9Key Formulas

Correlation Coefficient (r)

r = [n(Sum xy) - (Sum x)(Sum y)] / sqrt[(n Sum x2 - (Sum x)2)(n Sum y2 - (Sum y)2)]

Regression Line (LSRL)

Y-hat = b0 + b1 * x

Slope (b1)

b1 = [n(Sum xy) - (Sum x)(Sum y)] / [n(Sum x2) - (Sum x)2]

Intercept (b0)

b0 = y-bar - b1 * x-bar

Residual

e = Y - Y-hat (Observed - Predicted)

Coefficient of Determination

r2 = (r)2

10Memory Aids

"Correlation is NOT Causation!"

Repeat this mantra. A strong correlation only means variables move together -- it does NOT mean one causes the other.

r's Range: "Scale of Agreement"

Think of r from -1 to +1 like a scale: -1 is total disagreement (negative), +1 is total agreement (positive), 0 is no agreement.

r2 = "Percentage Explained"

How much of the story (variation in Y) is told by X. Higher r2 = better predictor.

Residuals: "Actual Minus Predicted"

e = Y - Y-hat. Positive means you underestimated, negative means you overestimated.

11Common Mistakes

Correlation Implies Causation

The most common mistake. A strong correlation only shows association, not cause-and-effect. There could be lurking variables or reverse causation.

Extrapolating Beyond the Data

Using the regression line to predict Y for X values outside your observed range is risky. The linear trend may not continue beyond the data.

Ignoring Outliers

Outliers can heavily influence both r and the regression line, potentially distorting the true relationship. Always inspect your scatter plot.

Using Linear Regression for Non-Linear Data

If the scatter plot shows a curved relationship, a linear model will be a poor fit. Check the residual plot for patterns before trusting your model.

Confusing b1 with r

r is a standardized measure of strength and direction (-1 to 1). b1 is the actual change in Y per unit change in X, and its value depends on the units of measurement.

Quick Revision

  • Correlation (r) measures the strength and direction of a linear relationship. Ranges from -1 to +1.
  • r2 tells you what percentage of the variation in Y is explained by X.
  • Regression line: Y-hat = b0 + b1*x. Slope (b1) = change in Y per unit X. Intercept (b0) = predicted Y when X = 0.
  • Least squares minimizes the sum of squared residuals to find the best-fitting line.
  • Residual: e = Y - Y-hat. Positive = above the line, negative = below.
  • Scatter plots reveal direction, form, strength, and outliers. Always plot before computing.
  • Extrapolation is dangerous -- do not predict beyond your data range.
  • Correlation does NOT imply causation.
  • Outliers can heavily influence both r and the regression line.

Frequently Asked Questions

What is the main difference between correlation and regression?
Correlation measures the strength and direction of a linear relationship between two variables. Regression models that relationship with an equation, allowing you to make predictions.
Can I use regression if my variables are not linearly related?
Linear regression is specifically for linear relationships. If your data shows a curve, you may need non-linear regression techniques or transformations (like logarithmic or polynomial).
What does a negative correlation mean?
A negative correlation means that as one variable increases, the other tends to decrease. For example, as hours spent watching TV increase, exam scores might decrease.
Is a high r-value always good?
A high absolute r-value indicates a strong linear relationship, which is often desirable for prediction. However, it does not guarantee the model is appropriate -- an outlier could inflate r. Always examine the scatter plot.
What if r = 0?
r = 0 indicates no linear relationship. However, there might still be a non-linear relationship (e.g., a parabolic curve). Always visualize your data with a scatter plot.

Practice Quiz

Test your knowledge — select the correct answer for each question.

1.Which of the following values for the correlation coefficient (r) indicates the strongest linear relationship?

2.The coefficient of determination (r²) measures:

3.If the slope of a regression line is -3, it means:

4.A scatter plot shows points tightly clustered around a downward-sloping line. The correlation coefficient (r) would likely be:

5.What is the primary danger of extrapolation in regression?

6.Which of the following statements is TRUE?

7.If r² = 0.75, what does this mean?

8.The formula y-hat = b0 + b1*x represents the:

9.An outlier in a scatter plot can:

10.If a residual for a data point is positive, it means:

Final Study Advice

  • 1.Practice computing r and the regression line by hand to build intuition for the formulas.
  • 2.Always create a scatter plot before interpreting r or fitting a line.
  • 3.Use r2 to communicate results -- saying "64% of the variation is explained" is more intuitive than "r = 0.8."
  • 4.Check residual plots for patterns before trusting your linear model.
  • 5.Remember: a high r does not mean the model is correct. Outliers and non-linearity can mislead.

Related Topics