This post follows this one: Why does ridge estimate become better than OLS by adding a constant to the diagonal?
Here is my question:
As far as I know, ridge regularization uses a ℓ2-norm (euclidean distance). But why do we use the square of this norm ? (a direct application of ℓ2 would result with the square root of the sum of beta squared).
As a comparison, we don’t do this for the LASSO, which uses a ℓ1-norm to regularize. But here it’s the “real” ℓ1 norm (just sum of the square of the beta absolute values, and not square of this sum).
Can someone help me to clarify?
There are lots of penalized approaches that have all kinds of different penalty functions now (ridge, lasso, MCP, SCAD). The question of why is one of a particular form is basically “what advantages/disadvantages does such a penalty provide?”.
Properties of interest might be:
1) nearly unbiased estimators (note all penalized estimators will be biased)
2) Sparsity (note ridge regression does not produce sparse results i.e. it does not shrink coefficients all the way to zero)
3) Continuity (to avoid instability in model prediction)
These are just a few properties one might be interested in a penalty function.
It is a lot easier to work with a sum in derivations and theoretical work: e.g. ||β||22=∑|βi|2 and ||β||1=∑|βi|. Imagine if we had √(∑|βi|2) or (∑|βi|)2. Taking derivatives (which is necessary to show theoretical results like consistency, asymptotic normality etc) would be a pain with penalties like that.