# Why does shrinkage really work, what’s so special about 0?

There is already a post on this site talking about the same issue:
Why does shrinkage work?

But, even though the answers are popular, I don’t believe the gist of the question is really addressed. It is pretty clear that introducing some bias in estimation brings in reduction in variance and may improve estimation quality. However:

1) Why the damage done by introducing bias is less compared with the gain in variance?

2) Why does it always work? For example in case of Ridge Regression: the existence theorem

3) What’s so interesting about 0 (the origin)? Clearly we can shrink anywhere we like (i.e. Stein estimator), but is it going to work as good as the origin?

4) Why various universal coding schemes prefer lower number of bits around the origin? Are these hypotheses simply more probable?

Answers with references to proven theorems or established results are expected.

1) Why the damage done by introducing bias is less compared with the gain in variance?

It doesn’t have to, it just usually is. Whether the tradeoff is worth it depends on the loss function. But the things we care about in real life are often similar to the squared error (e.g. we care more about one big error than about two errors half the size).

As a counterexample – imagine that for college admissions we shrink people’s SAT scores a bit towards the mean SAT for their demographic (however defined). If done properly, this will reduce variance and mean squared error of estimates of (some sort of) ability of the person while introducing bias. Most people would IMHO argue that such a tradeoff is unacceptable.

2) Why does it always work?

3) What’s so interesting about 0 (the origin)? Clearly we can shrink anywhere we like (i.e. Stein estimator), but is it going to work as good as the origin?

I think this is because we usually shrink coefficients or effect estimates. There are reasons to believe most effects are not large (see e.g. Andrew Gelman’s take). One way to put it is that a world where everything influences everything with a strong effect is a violent unpredictable world. Since our world is predictable enough to let us live long lives and build semi-stable civilizations, it follows that most effects are not large.

Since most effects are not large it is useful to wrongfully shrink the few really big ones while also correctly shrinking the loads of negligible effects.

I believe this is just a property of our world and you probably could construct self-consistent worlds where shrinkage isn’t practical (most likely by making mean-squared error an impractical loss function). It just doesn’t happen to be the world we live in.

On the other hand, when we think of shrinkage as a prior distribution in Bayesian analysis, there are cases where shrinkage to 0 is actively harmful in practice.

One example is the length scale in Gaussian Processes (where 0 is problematic) the recommendation in Stan’s manual is to use a prior that puts negligible weight close to zero i.e. effectively “shrinking” small values away from zero. Similarly, recommended priors for dispersion in negative binomial distribution effectively shrink away from zero. Last but not least, whenever the normal distribution is parametrized with precision (as in INLA), it is useful to use inverse-gamma or other prior distributions that shrink away from zero.

4) Why various universal coding schemes prefer lower number of bits around the origin? Are these hypotheses simply more probable?

Now this is way out of my depth, but Wikipedia says that in universal coding scheme we expect (by definition) $$P(i) ≥ P(i + 1)$$ for all positive $$i$$ so this property seems to be a simple consequence of the definition and not related to shrinkage (or am I missing something?)