Very often when I investigate new statistical methods and concepts, I run into the squared difference (or the mean squared error, or a plethora of other epithets). Just as an example, Pearson’s r is decided based on the mean squared difference from the regression line that the points lie. For ANOVA’s, you’re looking at the sum of squares, and so on.
Now, I understand that by squaring everything, you make sure that data with outliers really gets penalized. However, why is the exponent being used exactly 2? Why not 2.1, or e, or pi, or whatever? Is there some special reason why 2 is used or is it just a convention? I suspect that the explanation might have something to do with the bell curve, but I’m quite sure.
Answer
A decisiontheoretic approach to statistics provides a deep explanation. It says that squaring differences is a proxy for a wide range of loss functions which (whenever they might be justifiably adopted) lead to considerable simplification in the possible statistical procedures one has to consider.
Unfortunately, explaining what this means and indicating why it is true takes a lot of setting up. The notation can quickly become incomprehensible. What I aim to do here, then, is just to sketch the main ideas, with little elaboration. For fuller accounts see the references.
A standard, rich model of data \mathbf x posits that they are a realization of a (real, vectorvalued) random variable \mathbf X whose distribution F is known only to be an element of some set \Omega of distributions, the states of nature. A statistical procedure is a function t of \mathbf x taking values in some set of decisions D, the decision space.
For instance, in a prediction or classification problem \mathbf x would consist of a union of a “training set” and a “test set of data” and t will map \mathbf x into a set of predicted values for the test set. The set of all possible predicted values would be D.
A full theoretical discussion of procedures has to accommodate randomized procedures. A randomized procedure chooses among two or more possible decisions according to some probability distribution (which depends on the data \mathbf x). It generalizes the intuitive idea that when the data do not seem to distinguish between two alternatives, you subsequently “flip a coin” to decide on a definite alternative. Many people dislike randomized procedures, objecting to making decisions in such an unpredictable manner.
The distinguishing feature of decision theory is its use of a loss function W. For any state of nature F \in \Omega and decision d \in D, the loss
W(F,d)
is a numeric value representing how “bad” it would be to make decision d when the true state of nature is F: small losses are good, large losses are bad. In a hypothesis testing situation, for instance, D has the two elements “accept” and “reject” (the null hypothesis). The loss function emphasizes making the right decision: it is set to zero when the decision is correct and otherwise is some constant w. (This is called a “01 loss function:” all bad decisions are equally bad and all good decisions are equally good.) Specifically, W(F,\text{ accept})=0 when F is in the null hypothesis and W(F,\text{ reject})=0 when F is in the alternative hypothesis.
When using procedure t, the loss for the data x when the true state of nature is F can be written W(F, t(x)). This makes the loss W(F, t(X)) a random variable whose distribution is determined by (the unknown) F.
The expected loss of a procedure t is called its risk, r_t. The expectation uses the true state of nature F, which therefore will appear explicitly as a subscript of the expectation operator. We will view the risk as a function of F and emphasize that with the notation:
r_t(F) = \mathbb{E}_F(W(F, t(X))).
Better procedures have lower risk. Thus, comparing risk functions is the basis for selecting good statistical procedures. Since rescaling all risk functions by a common (positive) constant would not change any comparisons, the scale of W makes no difference: we are free to multiply it by any positive value we like. In particular, upon multiplying W by 1/w we may always take w=1 for a 01 loss function (justifying its name).
To continue the hypothesis testing example, which illustrates a 01 loss function, these definitions imply the risk of any F in the null hypothesis is the chance that the decision is “reject,” while the risk of any F in the alternative is the chance that the decision is “accept.” The maximum value (over all F in the null hypothesis) is the test size, while the part of the risk function defined on the alternative hypothesis is the complement of the test power (\text{power}_t(F) = 1 – r_t(F)). In this we see how the entirety of classical (frequentist) hypothesis testing theory amounts to a particular way to compare risk functions for a special kind of loss.
By the way, everything presented so far is perfectly compatible with all mainstream statistics, including the Bayesian paradigm. In addition, Bayesian analysis introduces a “prior” probability distribution over \Omega and uses this to simplify the comparison of risk functions: the potentially complicated function r_t can be replaced by its expected value with respect to the prior distribution. Thus all procedures t are characterized by a single number r_t; a Bayes procedure (which usually is unique) minimizes r_t. The loss function still plays an essential role in computing r_t.
There is some (unavoidable) controversy surrounding the use of loss functions. How does one pick W? It is essentially unique for hypothesis testing, but in most other statistical settings many choices are possible. They reflect the values of the decisionmaker. For example, if the data are physiological measurements of a medical patient and the decisions are “treat” or “do not treat,” the physician must consider–and weigh in the balance–the consequences of either action. How the consequences are weighed may depend on the patient’s own wishes, their age, their quality of life, and many other things. Choice of a loss function can be fraught and deeply personal. Normally it should be not left to the statistician!
One thing we would like to know, then, is how would the choice of best procedure change when the loss is changed? It turns out that in many common, practical situations a certain amount of variation can be tolerated without changing which procedure is best. These situations are characterized by the following conditions:

The decision space is a convex set (often an interval of numbers). This means that any value lying between any two decisions is also a valid decision.

The loss is zero when the best possible decision is made and otherwise increases (to reflect discrepancies between the decision that is made and the best one that could be made for the true–but unknown–state of nature).

The loss is a differentiable function of the decision (at least locally near the best decision). This implies it is continuous–it does not jump the way a 01 loss does–but it also implies that it changes relatively little when the decision is close to the best one.
When these conditions hold, some complications involved in comparing risk functions go away. The differentiability and convexity of W allow us to apply Jensen’s Inequality to show that
(1) We don’t have to consider randomized procedures [Lehmann, corollary 6.2].
(2) If one procedure t is considered to have the best risk for one such W, it can be improved into a procedure t^{*} which depends only on a sufficient statistic and has at least as good a risk function for all such W [Kiefer, p. 151].
As an example, suppose \Omega is the set of Normal distributions with mean \mu (and unit variance). This identifies \Omega with the set of all real numbers, so (abusing notation) I will also use “\mu” to identify the distribution in \Omega with mean \mu. Let X be an iid sample of size n from one of these distributions. Suppose the objective is to estimate \mu. This identifies the decision space D with all possible values of \mu (any real number). Letting \hat\mu designate an arbitrary decision, the loss is a function
W(\mu, \hat\mu) \ge 0
with W(\mu, \hat\mu)=0 if and only if \mu=\hat\mu. The preceding assumptions imply (via Taylor’s Theorem) that
W(\mu, \hat\mu) = w_2 (\hat\mu – \mu)^2 + o(\hat\mu – \mu)^2
for some constant positive number w_2. (The littleo notation “o(y)^p” means any function f where the limiting value of f(y) / y^p is 0 as y\to 0.) As previously noted, we are free to rescale W to make w_2=1. For this family \Omega, the mean of X, written \bar X, is a sufficient statistic. The previous result (quoted from Kiefer) says any estimator of \mu, which could be some arbitrary function of the n variables (x_1, \ldots, x_n) that is good for one such W, can be converted into an estimator depending only on \bar x which is at least as good for all such W.
What has been accomplished in this example is typical: the hugely complicated set of possible procedures, which originally consisted of possibly randomized functions of n variables, has been reduced to a much simpler set of procedures consisting of nonrandomized functions of a single variable (or at least a fewer number of variables in cases where sufficient statistics are multivariate). And this can be done without worrying about precisely what the decisionmaker’s loss function is, provided only that it is convex and differentiable.
What is the simplest such loss function? The one that ignores the remainder term, of course, making it purely a quadratic function. Other loss functions in this same class include powers of z = \hat\mu\mu that are greater than 2 (such as the 2.1, e, and \pi mentioned in the question), \exp(z)1z, and many more.
The blue (upper) curve plots 2(\exp(z)1z) while the red (lower) curve plots z^2. Because the blue curve also has a minimum at 0, is differentiable, and convex, many of the nice properties of statistical procedures enjoyed by quadratic loss (the red curve) will apply to the blue loss function, too (even though globally the exponential function behaves differently than the quadratic function).
These results (although obviously limited by the conditions that were imposed) help explain why quadratic loss is ubiquitous in statistical theory and practice: to a limited extent, it is an analytically convenient proxy for any convex differentiable loss function.
Quadratic loss is by no means the only or even the best loss to consider. Indeed, Lehmann writes that
Convex loss functions have been seen to lead to a number of simplifications of estimation problems. One may wonder, however, whether such loss functions are likely to be realistic. If W(F, d) represents not just a measure of inaccuracy but a real (for example, financial) loss, one may argue that all such losses are bounded: once you have lost all, you cannot lose any more. …
… [F]astgrowing loss functions lead to estimators that tend to be sensitive to the assumptions made about [the] tail behavior [of the assumed distribution], and these assumptions typically are based on little information and thus are not very reliable.
It turns out that the estimators produced by squared error loss often are uncomfortably sensitive in this respect.
[Lehman, section 1.6; with some changes of notation.]
Considering alternative losses opens up a rich set of possibilities: quantile regression, Mestimators, robust statistics, and much more can all be framed in this decisiontheoretic way and justified using alternative loss functions. For a simple example, see Percentile Loss Functions.
References
Jack Carl Kiefer, Introduction to Statistical Inference. SpringerVerlag 1987.
E. L. Lehmann, Theory of Point Estimation. Wiley 1983.
Attribution
Source : Link , Question Author : Speldosa , Answer Author : Nick Cox