Can anyone provide an intuitive view on why it is better to have smaller beta?
For LASSO I can understand that, there is a feature selection component here. Less features make the model simpler and therefore less likely to be over-fitting.
However, for ridge, all the features (factors) are kept. Only the values are smaller (in L2 norm sense). How does this make the model simpler?
Can anyone provide an intuitive view on this?
TL;DR – Same principle applies to both LASSO and Ridge
Less features make the model simpler and therefore less likely to be over-fitting
This is the same intuition with ridge regression – we prevent the model from over-fitting the data, but instead of targeting small, potentially spurious variables (which get reduced to zero in LASSO), we instead target the biggest coefficients which might be overstating the case for their respective variables.
The L2 penalty generally prevents the model from placing “too much” importance on any one variable, because large coefficients are penalized more than small ones.
This might not seem like it “simplifies” the model, but it does a similar task of preventing the model from over-fitting to the data at hand.
An example to build intuition
Take a concrete example – you might be trying to predict hospital readmissions based on patient characteristics.
In this case, you might have a relatively rare variable (such as an uncommon disease) that happens to be very highly correlated in your training set with readmission. In a dataset of 10,000 patients, you might only see this disease 10 times, with 9 readmissions (an extreme example to be sure)
As a result, the coefficient might be massive relative to the coefficient of other variables. By minimizing both MSE and the L2 penalty, this would be a good candidate for ridge regression to “shrink” towards a smaller value, since it is rare (so doesn’t impact MSE as much), and an extreme coefficient value.