# Why Normality assumption in linear regression

My question is very simple: why we choose normal as the distribution that error term follows in the assumption of linear regression? Why we don’t choose others like uniform, t or whatever?

We do choose other error distributions. You can in many cases do so fairly easily; if you are using maximum likelihood estimation, this will change the loss function. This is certainly done in practice.

Laplace (double exponential errors) correspond to least absolute deviations regression/$$L1L_1$$ regression (which numerous posts on site discuss). Regressions with t-errors are occasionally used (in some cases because they’re more robust to gross errors), though they can have a disadvantage — the likelihood (and therefore the negative of the loss) can have multiple modes.

Uniform errors correspond to an $$L∞L_\infty$$ loss (minimize the maximum deviation); such regression is sometimes called Chebyshev approximation (though beware, since there’s another thing with essentially the same name). Again, this is sometimes done (indeed for simple regression and smallish data sets with bounded errors with constant spread the fit is often easy enough to find by hand, directly on a plot, though in practice you can use linear programming methods, or other algorithms; indeed, $$L∞L_\infty$$ and $$L1L_1$$ regression problems are duals of each other, which can lead to sometimes convenient shortcuts for some problems).

In fact, here’s an example of a “uniform error” model fitted to data by hand:

It’s easy to identify (by sliding a straightedge toward the data) that the four marked points are the only candidates for being in the active set; three of them will actually form the active set (and a little checking soon identifies which three lead to the narrowest band that encompassess all the data). The line at the center of that band (marked in red) is then the maximum likelihood estimate of the line.

Many other choices of model are possible and quite a few have been used in practice.

Note that if you have additive, independent, constant-spread errors with a density of the form $$kexp(−c.g(ε))k\,\exp(-c.g(\varepsilon))$$, maximizing the likelihood will correspond to minimizing $$∑ig(ei)\sum_i g(e_i)$$, where $$eie_i$$ is the $$ii$$th residual.

However, there are a variety of reasons that least squares is a popular choice, many of which don’t require any assumption of normality.