Understanding bootstrapping for validation and model selection

I think I understand how the fundamentals of bootstrapping work, but I’m not sure I understand how I can use bootstrapping for model selection or to avoid overfitting.

For model selection, for example, would you just choose the model that yields the lowest error (maybe variance?) across its bootstrap samples?

Are there any texts that discuss how to use bootstrapping for model selection or validation?

EDIT: See this thread, and the answer by @mark999 for more context behind this question.

Answer

First you have to decide if you really need model selection, or you just need to model. In the majority of situations, depending on dimensionality, fitting a flexible comprehensive model is preferred.

The bootstrap is a great way to estimate the performance of a model. The simplest thing to estimate is variance. More to your original point, the bootstrap can estimate the likely future performance of a given modeling procedure, on new data not yet realized.

If using resampling (bootstrap or cross-validation) to both choose model tuning parameters and to estimate the model, you will need a double bootstrap or nested cross-validation.

In general the bootstrap requires fewer model fits (often around 300) than cross-validation (10-fold cross-validation should be repeated 50-100 times for stability).

Some simulation studies may be found at http://biostat.mc.vanderbilt.edu/rms

Attribution
Source : Link , Question Author : Amelio Vazquez-Reina , Answer Author : Frank Harrell

Leave a Comment