Why is best subset selection not favored in comparison to lasso?

I’m reading about best subset selection in the Elements of statistical learning book.
If I have 3 predictors $x_1,x_2,x_3$, I create $2^3=8$ subsets:

  1. Subset with no predictors
  2. subset with predictor $x_1$
  3. subset with predictor $x_2$
  4. subset with predictor $x_3$
  5. subset with predictors $x_1,x_2$
  6. subset with predictors $x_1,x_3$
  7. subset with predictors $x_2,x_3$
  8. subset with predictors $x_1,x_2,x_3$

Then I test all these models on the test data to choose the best one.

Now my question is why is best subset selection not favored in
comparison to e.g. lasso?

If I compare the thresholding functions of best subset and lasso, I see that the best subset sets some of the coefficients to zero, like lasso.
But, the other coefficient (non-zero ones) will still have the ols values, they will be unbiasd. Whereas, in lasso some of the coefficients will be zero and the others (non-zero ones) will have some bias.
The figure below shows it better:
enter image description here

From the picture the part of the red line in the best subset case is laying onto the gray one. The other part is laying in the x-axis where some of the coefficients are zero. The gray line defines the unbiased solutions. In lasso, some bias is introduced by $\lambda$. From this figure I see that best subset is better than lasso! What are the disadvantages of using best subset?

Answer

In subset selection, the nonzero parameters will only be unbiased if you have chosen a superset of the correct model, i.e., if you have removed only predictors whose true coefficient values are zero. If your selection procedure led you to exclude a predictor with a true nonzero coefficient, all coefficient estimates will be biased. This defeats your argument if you will agree that selection is typically not perfect.

Thus, to make “sure” of an unbiased model estimate, you should err on the side of including more, or even all potentially relevant predictors. That is, you should not select at all.

Why is this a bad idea? Because of the bias-variance tradeoff. Yes, your large model will be unbiased, but it will have a large variance, and the variance will dominate the prediction (or other) error.

Therefore, it is better to accept that parameter estimates will be biased but have lower variance (regularization), rather than hope that our subset selection has only removed true zero parameters so we have an unbiased model with larger variance.

Since you write that you assess both approaches using cross-validation, this mitigates some of the concerns above. One remaining issue for Best Subset remains: it constrains some parameters to be exactly zero and lets the others float freely. So there is a discontinuity in the estimate, which isn’t there if we tweak the lasso $\lambda$ beyond a point $\lambda_0$ where a predictor $p$ is included or excluded. Suppose that cross-validation outputs an “optimal” $\lambda$ that is close to $\lambda_0$, so we are essentially unsure whether p should be included or not. In this case, I would argue that it makes more sense to constrain the parameter estimate $\hat{\beta}_p$ via the lasso to a small (absolute) value, rather than either completely exclude it, $\hat{\beta}_p=0$, or let it float freely, $\hat{\beta}_p=\hat{\beta}_p^{\text{OLS}}$, as Best Subset does.

This may be helpful: Why does shrinkage work?

Attribution
Source : Link , Question Author : Ville , Answer Author : Stephan Kolassa

Leave a Comment