I already have an idea about pros and cons of ridge regression and the LASSO.
For the LASSO, L1 penalty term will yield a sparse coefficient vector, which can be viewed as a feature selection method. However, there are some limitations for the LASSO. If the features have high correlation, the LASSO will only select one of them. In addition, for problems where $p$ > $n$, the LASSO will select at most $n$ parameters ($n$ and $p$ are the number of observations and parameters, respectively). These make the LASSO empirically a suboptimal method in terms of predictability compared to ridge regression.
For ridge regression, it offers better predictability in general. However, its interpretability is not as nice as the LASSO.
The above explanation can often be found in textbooks in machine learning/data mining. However, I am still confused about two things:
If we normalize the feature range (say between 0 and 1, or with zero mean and unit variance), and run ridge regression, we can still have an idea of feature importance by sorting the absolute values of coefficients (the most important feature has the highest absolute value of coefficients). Though we are not selecting features explicitly, interpretability isn’t lost using ridge regression. At the same time, we can still achieve high prediction power. Then why do we need the LASSO? Am I missing something here?
Is the LASSO preferred due to its feature selection nature? To my understanding, the reasons why we need feature selection are the ability to generalize and ease of computation.
For ease of computation, we don’t want to feed all 1 million features into our model if we are performing some NLP tasks, so we drop some obviously useless features first to reduce the computational cost. However, for the LASSO, we can only know the feature selection result (the sparse vector) after we feed all the data into our model, so we don’t benefit from the LASSO in terms of reducing computational cost. We can only make prediction a little faster as now we only feed the subset of features (say 500 out of 1 million) into our model to generate predicted results.
If the LASSO is preferred for its ability to generalize, then we can also achieve the same goal using ridge regression (or any other kind of regularization). Why do we need LASSO (or elastic nets) again? Why can’t we just stick to ridge regression?
Could someone please shed some lights on this? Thanks!
If you order 1 million ridge-shrunk, scaled, but non-zero features, you will have to make some kind of decision: you will look at the n best predictors, but what is n? The LASSO solves this problem in a principled, objective way, because for every step on the path (and often, you’d settle on one point via e.g. cross validation), there are only m coefficients which are non-zero.
Very often, you will train models on some data and then later apply it to some data not yet collected. For example, you could fit your model on 50.000.000 emails and then use that model on every new email. True, you will fit it on the full feature set for the first 50.000.000 mails, but for every following email, you will deal with a much sparser and faster, and much more memory efficient, model. You also won’t even need to collect the information for the dropped features, which may be hugely helpful if the features are expensive to extract, e.g. via genotyping.
Another perspective on the L1/L2 problem exposed by e.g. Andrew Gelman is that you often have some intuition what your problem may be like. In some circumstances, it is possible that reality is truly sparse. Maybe you have measured millions of genes, but it is plausible that only 30.000 of them actually determine dopamine metabolism. In such a situation, L1 arguably fits the problem better.
In other cases, reality may be dense. For example, in psychology, “everything correlates (to some degree) with everything” (Paul Meehl). Preferences for apples vs. oranges probably does correlate with political leanings somehow – and even with IQ. Regularization might still make sense here, but true zero effects should be rare, so L2 might be more appropriate.