Why regularize all parameters in the same way?

My question relates to regularization in linear regression and logistic regression. I’m currently doing week 3 of Andrew Ng’s Machine Learning course on Coursera. I understand how overfitting can be a common problem and I have some intuition for how regularization can reduce overfitting. My question is can we improve our models by regularizing different parameters in different ways?


Example:

Let’s say we’re trying to fit w0+w1x1+w2x2+w3x3+w4x4. This question is about why we penalize for high w1 values in the same way that penalize for high w2 values.

If we know nothing about how our features (x1,x2,x3,x4) were constructed, it makes sense to treat them all in the same way when we do regularization: a high w1 value should yield as much “penalty” as a high w3 value.

But let’s say we have additional information: let’s say we only had 2 features originally: x1 and x2. A line was underfitting our training set and we wanted a more squiggly shaped decision boundary, so we constructed x3=x21 and x4=x32. Now we can have more complex models, but the more complex they get, the more we risk overfitting our model to the training data. So we want to strike a balance between minimizing the cost function and minimizing our model complexity. Well, the parameters that represent higher exponentials (x3, x4) are drastically increasing the complexity of our model. So shouldn’t we penalize more for high w3, w4 values than we penalize for high w1,w2 values?

Answer

Well, the parameters that represent higher exponentials (x3,x4) are drasticly increasing the complexity of our model. So shouldn’t we penalize more for high w3,w4 values than we penalize for high w1,w2 values?

The reason we say that adding quadratic or cubic terms increases model complexity is that it leads to a model with more parameters overall. We don’t expect a quadratic term to be in and of itself more complex than a linear term. The one thing that’s clear is that, all other things being equal, a model with more covariates is more complex.

For the purposes of regularization, one generally rescales all the covariates to have equal mean and variance so that, a priori, they are treated as equally important. If some covariates do in fact have a stronger relationship with the dependent variable than others, then, of course, the regularization procedure won’t penalize those covariates as strongly, because they’ll have greater contributions to the model fit.

But what if you really do think a priori that one covariate is more important than another, and you can quantify this belief, and you want the model to reflect it? Then what you probably want to do is use a Bayesian model and adjust the priors for the coefficients to match your preexisting belief. Not coincidentally, some familiar regularization procedures can be construed as special cases of Bayesian models. In particular, ridge regression is equivalent to a normal prior on the coefficients, and lasso regression is equivalent to a Laplacian prior.

Attribution
Source : Link , Question Author : Atte Juvonen , Answer Author : Kodiologist

Leave a Comment