# Where do the assumptions for linear regression come from? [duplicate]

I’v already known that there are several assumpations when using linear regression model. But I cannot understand why some of them exists. They are:

1. independent errors
2. normal distribution of errors
3. homoscedasticity

Why cannot I just simply use least square method without these assumptions ?

I want to know how it affect the $R^2$, slope, and p-value if some assumptions are invalid.
For example: “if the independent error assumption is false, the p-value is less then its true value.”

Why cannot I just simply use least square method without these assumptions ?

You can.

However, inference – such as calculation of standard errors, confidence intervals and p-values – rely on those assumptions.

You can compute a least squares line without them holding… it just won’t necessarily be the best thing to do.

You can break each of those assumptions and derive something else than least squares which might make more sense.

e.g. dependence might lead you to ARIMA models or mixed effects models (for example)

non-normal errors might lead you to GLMs (or a number of other things)

heteroscedasticity might lead you to GLMs, or weighted regression, or heterocsedasticity-consistent inference

As for where they come from –

• The independence assumption is basically something that holds approximately in many cases, and assuming exact independence makes life (much) easier.

• normality is a good approximation to errors in some cases (if you have many sources of small, independent errors, where none dominate, for example, the overall error will tend to be approximately normal), and again makes life easier (least squares is maximum likelihood there).

The Gauss-Markov theorem is relevant, and – at least for cases where not all linear estimators are bad – encourages us that we might consider it when those assumptions don’t all hold.

• constant variance is another simplifying assumption that is sometimes true.

When you take all three together, the kinds of inference mentioned above becomes very tractable. And sometimes, those assumptions are reasonable.

If sample sizes are large and no points are unduly influential, normality is probably the least critical; inference-wise you can get by with a little non-normality quite happily, as long as you’re not trying to construct prediction intervals.

Historically speaking you might find this:

http://en.wikipedia.org/wiki/Least_squares#History

and perhaps this interesting (if you can access it).

Edit:

whether slope, p-value, or R2 is still valid if some assumption is invalid

I’ll have to make some assumptions about what you mean by ‘valid’

The wikipedia article on OLS mentions some details on consistency and optimality in the second paragraph. Later in the same article it discusses various assumptions and their violation.

This paper discusses the consistency of least squares slope estimates under various conditions, but if you don’t know things like the difference between the different types of convergence it might not help much.

For the effect of contravening the assumption of equal variances, see here.

The distribution of p-values relies on all the assumptions, but as the sample sizes get very large then (under some conditions I’m not going to essay here), the CLT gives you normality of the parameter estimates when the errors aren’t normal; as a result, mild non-normality in particular won’t necessarily be an issue if teh samples are reasonably large. The p-values do rely on the equal variance assumption (see the above link on heteroskedasticity), and on the independence assumption.

On $R^2$ – if you think of $R^2$ as estimating a population quantity, then being based on variance it’s critically impacted by violation of the equal variance and independence assumptions. On the other hand $R^2$ isn’t generally a particularly important quantity.

Major Edit 2:

Sorry for unclear question. I want to know some conclusion like “if the independent error assumption is false, the p-value is less then its true value.” Or whether these kind of conclusion exists

The problem with breaking independence is there are infinitely many ways that errors can be dependent and the direction of effects on things like p-values can be complex. There’s no single simple rule unless the domain is restricted somewhat. If you specify particular forms and directions of dependence, some conclusions are possible.

For example, when the errors are positively autocorrelated, the regression slope standard errors tend to be reduced, making t-ratios biased away from 0, and hence p-values lower (more significant).

Similarly, the direction of effect of heteroskedasticity depends on specific details of the nature of the departure.

If you have particular kinds of deviation from assumptions in mind you can investigate the impact on variances/standard errors and hence on things like p-values and $R^2$ very easily via the use of simulation (though in many cases you can also get a fair way with algebra).

(Just as a general piece of advice, you may notice that many of your questions have been directly answered in the relevant stats articles on wikipedia. It would be worth your time to read through these articles and some of the articles they link to.)