# What is the procedure for “bootstrap validation” (a.k.a. “resampling cross-validation”)?

“Bootstrap validation”/”resampling cross-validation” is new to me, but was discussed by the answer to this question. I gather it involves 2 types of data: the real data and simulated data, where a given set of simulated data is generated from the real data by resampling-with-replacement until the simulated data has the same size as the real data. I can think of two approaches to using such data types: (1) fit the model once, evaluate it many times on many simulated data sets; (2) fit the model many times using each of many simulated data sets, each time evaluate it against the real data. Which (if either) is best?

Short answer: Both validation techniques involve training and testing a number of models.

Long answer about how to do it best: That of course depends. But here a some thoughts that I use to guide my decisions about resampling validation. I’m chemometrician, so these strategies and also the terms are more or less closely related to analytical-chemical problems.

To explain my thoughts a bit, I think of validation as measuring model quality, and of training as measuring model parameters – this leads to quite powerful analogy to every other kind of measurement.

There are two different points of view to these approaches with respect to validation:

1. a traditional point of view for resampling validation is: the resampled data set (sometimes called surrogate data set or subset) is practically the same as the original (real) data set.
Therefore, a “surrogate model” fit to the surrogate data set is practically the same as the model fit with the whole real data set. But some samples are left out of the surrogate data set, the model is independent of these. Thus, I take those left out or out-of-bootstrap samples as independent validation set for the surrogate model and use the result as approximation of the whole-data-model.
However, the surrogate model often is not really equivalent with the whole-data-model: less samples were used for training (even for the bootstrap, the number of different samples is less). As long as the learning curve is increasing, the surrogate model is on average a bit worse than the whole-data-model. This is the well-known pessimistic bias of resampling validation (if you end up with an optimistic bias, that is usually an indicator that the left-out/oob test set was not independent of the model).

2. The second point of view is that the resampled data set is a perturbed version of the whole data set. Examining how the surrogate models (or their predictions for the left-out/oob samples) differ from the whole-data-model then tells something about model stability with respect to the training data.
From this perspective, the surrogate models are something like repeated measurements. Say your task is to measure the content of some mineral of a whole train of ore. The ore is not homogeneous. So you take physical samples from different locations and then look at the overall content and its variation across the train. Similarly, if you think you model may not be stable, you can look at the overall performance and variation of the surrogate models.

If you take that thought further, your approach (1) tells something about how much predictions of the same model vary for different samples of size $n$.
Your approach (2) is closer to the usual approaches. But as Momo already wrote, validation usually wants to measure the performance for unknown cases. Thus you need to take care the testing is not done with cases that are already known to the model. In other words, only the left-out cases are tested. That is repeated many times (each model leaves out a different set of cases) in order to (a) measure and (b) average out as good as possible the variations due the finite (small) sample sizes (for both testing and training).
I usually resample cases, e.g. one case = all measurements of one patient. Then the out-of-bag are all patients of which no measurements occur in the training data. This is useful if you know that measurements of one case are more similar to each other than to measurements of other cases (or at least you cannot exclude this possibility).

Not that resampling validation allows you to measure performance for unknown samples. If in addition you want to measure the performance for unknown future samples (instrumental drift!), then you need a test set that is measured “in the future” i.e. a certain time after all training samples were measured. In analytical chemistry, this is needed e.g. if you want to find out how often you need to redo the calibration of your instrument (for each determination, daily, weekly, monthly, …)

Bootstrap vs. cross validation terminology:

• resampling with replacement is often called bootstrap,
• resampling without replacement cross-validation.

Both can have some kind of stratification. Historically, the splitting for cross validation (at least in chemometrics) has often been done in a non-random fashion, e.g. a 3-fold cross validation of the form abcabc..abc (data set sorted wrt. the outcome) for calibration/regression if you have very few cases (physical samples), and you want to make sure that your whole data range is covered.

Both techniques are usually repeated/iterated a number of times. Again for historical reasons and at least in chemometrics, k-fold cross validation often means training and testing k models (each tested with the 1/kth of the data that was not involved in training). If such a random splitting is repeated, people call it iterated or repeated cross validation.

Also, the number of unique samples can (approximately) be chosen: for cross-validation via the $k$ of $k$-fold or the $n$ of leave-$n$-out cross validation. For bootstrap, you can draw more or less than $n$ samples into the subsample (this is rarely done).

• Note that the bootstrap is not appropriate for some model fitting techniques that first remove duplicate measurements.
• Some variants of the bootstrap exist, e.g. .632-bootstrap and .632+-bootstrap

Bootstrap resampling is said to be better (faster convergence, less iterations needed) than iterated $k$-fold cross validation. In a study for the kind of data I deal with, however, we found little overall difference: out-of-bootstrap had less variance but more bias than iterated $k$-fold cross validation.