I obtained three reduced models from a original full model using
- forward selection
- backward elimination
- L1 penalization technique (LASSO)
For the models obtained using forward selection/backward elimination, I obtained the cross validated estimate of prediction error using
R. For the model selected via LASSO, I used
The prediction error for LASSO was less than than the ones obtained for the others. So the model obtained via LASSO seems to be better in terms of its predictive capacity and variability. Is this a general phenomenon that always occurs or is it problem specific? What is the theoretical reasoning for this if this is a general phenomenon?
The LASSO and forward/backward model selection both have strengths and limitations. No far sweeping recommendation can be made. Simulation can always be explored to address this.
Both can be understood in the sense of dimensionality: referring to p the number of model parameters and n the number of observations. If you were able to fit models using backwards model selection, you probably didn’t have p≫n. In that case, the “best fitting” model is the one using all parameters… when validated internally! This is simply a matter of overfitting.
Overfitting is remedied using split sample cross validation (CV) for model evaluation. Since you didn’t describe this, I assume you didn’t do it. Unlike stepwise model selection, LASSO uses a tuning parameter to penalize the number of parameters in the model. You can fix the tuning parameter, or use a complicated iterative process to choose this value. By default, LASSO does the latter. This is done with CV so as to minimize the MSE of prediction. I am not aware of any implementation of stepwise model selection that uses such sophisticated techniques, even the BIC as a criterion would suffer from internal validation bias. By my account, that automatically gives LASSO leverage over “out-of-the-box” stepwise model selection.
Lastly, stepwise model selection can have different criteria for including/excluding different regressors. If you use the p-values for the specific model parameters’ Wald test or the resultant model R^2, you will not do well, mostly because of internal validation bias (again, could be remedied with CV). I find it surprising that this is still the way such models tend to be implemented. AIC or BIC are much better criteria for model selection.
There are a number of problems with each method. Stepwise model selection’s problems are much better understood, and far worse than those of LASSO. The main problem I see with your question is that you are using feature selection tools to evaluate prediction. They are distinct tasks. LASSO is better for feature selection or sparse model selection. Ridge regression may give better prediction since it uses all variables.
LASSO’s great strength is that it can estimate models in which p≫n, as can be the case forward (but not backward) stepwise regression. In both cases, these models can be effective for prediction only when there is a handful of very powerful predictors. If an outcome is better predicted by many weak predictors, then ridge regression or bagging/boosting will outperform both forward stepwise regression and LASSO by a long shot. LASSO is much faster than forward stepwise regression.
There is obviously a great deal of overlap between feature selection and prediction, but I never tell you about how well a wrench serves as a hammer. In general, for prediction with a sparse number of model coefficients and p≫n, I would prefer LASSO over forward stepwise model selection.