# Why divide by $n-2$ for residual standard errors

I was just watching a lecture on statistics and someone was calculating something called the residual standard error. It looked a lot like finding the average of the square of the residuals, the residuals being the difference between the prediction of your model and the actual values. So for a linear fit, the prediction is $\hat{y}(x_i)=mx_i+b$ and the actual value is $y_i$. So the residual is $r_i = (y_i – \hat{y}(x_i))$. The residual standard error is $\frac{1}{n-2}\sum_i r_i^2$. I don’t understand why dividing by $n-2$ is necessary?

Update: I have a better idea. If there are only two data points, then the residuals would all be zero. So you could not estimate the error with only two points. But this still does not explain why dividing by $n-2$ is a good idea. It only explains why the formula is undefined for $n=2$.

You use the residuals to estimate the distribution of the error.
https://en.wikipedia.org/wiki/Errors_and_residuals , but these are different things.

• Error’s are what the ‘true’ model includes as randomness
• Residuals are the differentiations that you ‘observe’ between a model fit and a measurement.

The residuals do not resemble the errors. When you fit a model then you will fit to the model plus the error terms. This means that the fitting has a tendency to fit a part of the error terms, in addition to the model, and this will in effect decrease the residuals in relation to the true errors (ie residuals < error, and in this particular case $residuals = error/(n-2)$).

The more parameters the model has (the more degrees of freedom the model has to fit, cover up, the partial error terms) the less the residuals will resemble the true distribution of the error.

So the expression $\frac{\sum r_i^2}{n-2}$ refers to $r_i$ as ‘residiual’ terms, but wishes to express some idea of variance in the ‘error’ terms and in order to do this it need to include the ‘$n-2$’ instead of the ‘$n$’ term because the residual terms have a slight bias.