Bias towards natural numbers in the case of least squares

Why do we seek to minimize x^2 instead of minimizing |x|^1.95 or |x|^2.05.
Are there reasons why the number should be exactly two or is it simply a convention that has the advantage of simplifying the math?

Answer

This question is quite old but I actually have an answer that doesn’t appear here, and one that gives a compelling reason why (under some reasonable assumptions) squared error is correct, while any other power is incorrect.

Say we have some data $D = \langle(\mathbf{x}_1,y_1),(\mathbf{x}_2,y_2),…,(\mathbf{x}_n,y_n)\rangle$ and want to find the linear (or whatever) function $f$ that best predicts the data, in the sense that the probability density $p_f(D)$ for observing this data should be maximal with regard to $f$ (this is called the maximum likelihood estimation). If we assume that the data are given by $f$ plus a normally distributed error term with standard deviation $\sigma$, then
$$p_f(D) = \prod_{i=1}^{n} \frac{1}{\sigma\sqrt{2\pi}} e^{-\frac{(y_i – f(\mathbf{x}_i))^2}{2\sigma^2}}.$$
This is equivalent to
$$\frac{1}{\sigma^n(2\pi)^{n/2}}e^{-\frac{1}{2\sigma^2}\sum_{i=1}^{n} (y_i – f(\mathbf{x}_i))^2}.$$
So maximizing $p_f(D)$ is accomplished by minimizing $\sum_{i=1}^{n} (y_i – f(\mathbf{x}_i))^2$, that is, the sum of the squared error terms.

Attribution
Source : Link , Question Author : Christian , Answer Author : Community

Leave a Comment