L2 loss, together with L0 and L1 loss, are three a very common “default” loss functions used when summarising a posterior by the minimum posterior expected loss. One reason for this is perhaps that they are relatively easy to compute (at least for 1d-distributions), L0 results in the mode, L1 in the median and L2 results in the mean. When teaching, I can come up with scenarios where L0 and L1 are reasonable loss functions (and not just “default”), but I’m struggling with a scenario where L2 would be a reasonable loss function. So my question:
For pedagogical purposes, what would be an example of when L2 is a good loss function for computing a minimum posterior loss?
For L0 it is easy to come up with scenarios from betting. Say you have calculated a posterior over the total number of goals in an upcoming soccer game and you are going to make a bet where you win $$$ if you correctly guess the number of goals and lose otherwise. Then L0 is a reasonable loss function.
My L1 example is a bit contrived. You are meeting a friend who will arrive at one of many airports and then travel to you by car, the problem is that you don’t know which airport (and can’t call your friend because she is up in the air). Given a posterior over which airport she might land in, where is a good place to position yourself so that the distance between her and you will be small, when she arrives? Here, the point that minimizes the expected L1 loss seems reasonable, if making the simplifying assumptions that her car will travel at constant speed directly to your location. That is, a one hour wait is twice as bad as a 30 min wait.
L2 is “easy.” It’s what you get by default if you do standard matrix methods like linear regression, SVD, etc. Until we had computers, L2 was the only game in town for a lot of problems, which is why everyone uses ANOVA, t-tests, etc. It’s also easier to get an exact answer using L2 loss with many fancier methods like Gaussian processes than it is to get an exact answer using other loss functions.
Relatedly, you can get the L2 loss exactly using a 2nd-order Taylor approximation, which isn’t the case for most loss functions (e.g. cross-entropy, ). This makes optimization easy with 2nd-order methods like Newton’s method. Lots of methods for dealing with other loss functions still use methods for L2 loss under-the-hood for the same reason (e.g. iteratively reweighted least squares, integrated nested Laplace approximations).
L2 is closely related to Gaussian distributions, and the Central Limit Theorem makes Gaussian distributions common. If your data-generating process is (conditionally) Gaussian, then L2 is the most efficient estimator.
L2 loss decomposes nicely, because of the law of total variance. That makes certain graphical models with latent variables especially easy to fit.
L2 penalizes terrible predictions disproportionately. This can be good or bad, but it’s often pretty reasonable. An hour-long wait might be four times as bad as a 30-minute wait, on average, if it causes lots of people to miss their appointments.