Why do we seek to minimize

`x^2`

instead of minimizing`|x|^1.95`

or`|x|^2.05`

.

Are there reasons why the number should be exactly two or is it simply a convention that has the advantage of simplifying the math?

**Answer**

This question is quite old but I actually have an answer that doesn’t appear here, and one that gives a compelling reason why (under some reasonable assumptions) squared error is correct, while any other power is incorrect.

Say we have some data $D = \langle(\mathbf{x}_1,y_1),(\mathbf{x}_2,y_2),…,(\mathbf{x}_n,y_n)\rangle$ and want to find the linear (or whatever) function $f$ that best predicts the data, in the sense that the probability density $p_f(D)$ for observing this data should be maximal with regard to $f$ (this is called the *maximum likelihood estimation*). If we assume that the data are given by $f$ plus a normally distributed error term with standard deviation $\sigma$, then

$$p_f(D) = \prod_{i=1}^{n} \frac{1}{\sigma\sqrt{2\pi}} e^{-\frac{(y_i – f(\mathbf{x}_i))^2}{2\sigma^2}}.$$

This is equivalent to

$$\frac{1}{\sigma^n(2\pi)^{n/2}}e^{-\frac{1}{2\sigma^2}\sum_{i=1}^{n} (y_i – f(\mathbf{x}_i))^2}.$$

So maximizing $p_f(D)$ is accomplished by minimizing $\sum_{i=1}^{n} (y_i – f(\mathbf{x}_i))^2$, that is, the sum of the squared error terms.

**Attribution***Source : Link , Question Author : Christian , Answer Author : Community*