Correlation and regression both answer the same broad question — do these two things move together? — but they answer it in different ways. Correlation gives you a single number that measures how strongly two quantitative variables track each other, and in which direction. Regression goes a step further and draws the line, so you can actually predict one variable from the other. Neither of them, on its own, tells you that one thing causes the other — and that gap is where most of the trouble lives.
I taught this near the end of nearly every stats course, and it’s where students are most likely to over-read their results. So let me lay out what each tool genuinely tells you, and then be very clear about the line you can’t cross with either.
Correlation: how strongly do they move together?
The correlation coefficient — written r— is a number between −1 and +1 that captures a linear relationship between two quantitative variables.
- The sign tells you direction. Positive r: as one goes up, the other tends to go up. Negative r: as one goes up, the other tends to go down.
- The sizetells you strength. Close to +1 or −1 is a strong, tight relationship. Close to 0 means little or no linear relationship.
That last word matters. An r near zero only means there’s no straight-line pattern. Two variables can be tightly related in a curved way and still show a correlation near zero — so a low r is not the same as “no relationship.” Accident rates against driver age, for instance, make a U-shape — high for the youngest and oldest drivers, low in the middle — and a straight-line correlation reads that as almost nothing, even though the pattern is real and strong.
Regression: drawing the line
Where correlation gives you one summary number, regression fits an actual line through the data, usually written as predicted y = a + b·x.
- b is the slope— the heart of the interpretation. It’s the predicted change in y for each one-unit increase in x.
- a is the intercept— the predicted value of y when x is zero. Useful sometimes, meaningless others (if x can’t realistically be zero, don’t read much into it).
So if a regression of exam score on hours studied gives a slope of 4, you’d say: each additional hour of study is associated with about 4 more points, on average. Notice the hedge — “associated with,” not “causes.” We’ll come back to that.
r-squared: how much does x explain?
You’ll also meet r²(r-squared), which is just the correlation squared, turned into a percentage. It tells you the proportion of the variation in y that’s explained by its linear relationship with x. An r² of 0.64 means about 64% of the variation in y is accounted for by x — and the rest is down to other things. It’s a quick honesty check on how much your line is really doing.
The line you cannot cross: correlation isn’t causation
This is the single most important idea on the page. A strong correlation — even a beautiful regression line — does notestablish that one variable causes the other. Three things can produce a relationship that isn’t cause-and-effect:
- A lurking variable. Ice cream sales and drowning deaths rise together, but neither causes the other — hot weather drives both.
- Reverse causation.Maybe you assumed x causes y when it’s really y nudging x.
- Plain coincidence. With enough variables, some will track each other by chance alone.
Observational data can show you that two things move together. It takes a controlled experiment — not a regression line — to earn the word “causes.” Keeping that straight is most of what separates a careful reading from a misleading one.
There’s a cartoon I love about exactly this. One person says, “I used to think correlation implied causation; then I took a statistics class, and now I don’t.” The other replies, “Sounds like the class helped.” First person: “Well, maybe.” That “well, maybe” isthe lesson — once you’ve really got it, you stop assuming causation is hiding behind every correlation.
The mistake I see most often
Two, and they’re both about over-reaching. The first is the causation slip above — treating a slope as if it proved cause and effect. The second is extrapolation: using the line to predict far outside the range of the data you actually have. A line fit to students who studied 0 to 10 hours tells you nothing reliable about someone who studied 40. The relationship you measured lives inside the data you collected; step outside it and the line is just a guess wearing a lab coat.
The fix is a single distinction. Predicting inside the range of your data is interpolation, and it’s fair game. Predicting outside it is extrapolation, and it’s where lines lie to you. Think of something growing fast — a crop, the price of a collectible. The early data looks like a rocket, so the line predicts the rocket forever. But crops run out of room and water, and buyers run out of money; sooner or later the curve flattens to a cap the straight line never saw coming. Stay inside your data, and watch your step at the edges.
Let’s make it click
If regression output looks like a wall of numbers you’re supposed to nod at, you’re not alone — and it’s usually one missing piece of intuition, not the whole topic, that makes it feel that way. Sorting out which number means what is exactly the kind of thing a single focused session fixes.