# What do normal residuals mean and what does this tell me about my data?

Pretty basic question:

What does a normal distribution of residuals from a linear regression mean? In terms of, how does this reflect on my original data from the regression?

I’m totally stumped, thanks guys

Linear regression in fact models the conditional expected values of your outcome. That means: if you knew the true values of the regression parameters (say $\beta_0$ and $\beta_1$), given a value of your predictor X, filling that out in the equation

will have you calculate the expected value for $Y$ over all (possible) observations that have this given value for $X$.

However: you don’t really expect any single $Y$ value for that given $X$ value to be exactly equal to the (conditional) mean. Not because your model is wrong, but because there are some effects you have not accounted for (e.g. measuring error). So these $Y$ values for a given $X$ values will fluctuate around the mean value (i.e. geometrically: around the point of the regression line for that $X$).

The normality assumption, now, says that the difference between the $Y$s and their matching $E[Y|X]$ follows a normal distribution with mean zero. This means, if you have an $X$ value, then you can sample a $Y$ value by first calculating $\beta_0 + \beta_1 X$ (i.e. again $E[Y|X]$, the point on the regression line), next sampling $\epsilon$ from that normal distribution and adding them:

In short: this normal distribution represents the variability in your outcome on top of the variability explained by the model.

Note: in most datasets, you don’t have multiple $Y$ values for any given $X$ (unless your predictor set is categorical), but this normality goes for the whole population, not just the observations in your dataset.

Note: I’ve done the reasoning for linear regression with one predictor, but the same goes for more: just replace “line” with “hyperplane” in the above.