# Should repeated cross-validation be used to assess predictive models?

I came across this 2012 article by Gitte Vanwinckelen and Hendrik Blockeel calling into question the utility of repeated cross-validation, which has become a popular technique for reducing the variance of cross-validation.

The authors demonstrated that while repeated cross-validation does decrease the variance of model predictions, since the same sample dataset is being resampled the mean of the resampled cross-validation estimates converges to a biased estimate of the true predictive accuracy and hence is not useful.

Should repeated cross-validation be used despite these limitations?

The argument that the paper seems to be making appears strange to me.

According to the paper, the goal of CV is to estimate $\alpha_2$, the expected predictive performance of the model on new data, given that the model was trained on the observed dataset $S$. When we conduct $k$-fold CV, we obtain an estimate $\hat A$ of this number. Because of the random partitioning of $S$ into $k$ folds, this is a random variable $\hat A \sim f(A)$ with mean $\mu_k$ and variance $\sigma^2_k$. In contrast, $n$-times-repeated CV yields an estimate with the same mean $\mu_k$ but smaller variance $\sigma^2_k/n$.

Obviously, $\alpha_2\ne \mu_k$. This bias is something we have to accept.

However, the expected error $\mathbb E\big[|\alpha_2-\hat A|^2\big]$ will be larger for smaller $n$, and will be the largest for $n=1$, at least under reasonable assumptions about $f(A)$, e.g. when $\hat A\mathrel{\dot\sim} \mathcal N(\mu_k,\sigma^2_k/n)$. In other words, repeated CV allows to get a more precise estimate of $\mu_k$ and it is a good thing because it gives a more precise estimate of $\alpha_2$.

Therefore, repeated CV is strictly more precise than non-repeated CV.

The authors do not argue with that! Instead they claim, based on the simulations, that

reducing the variance [by repeating CV] is, in many cases, not very useful, and essentially a waste of computational resources.

This just means that $\sigma^2_k$ in their simulations was pretty low; and indeed, the lowest sample size they used was $200$, which is probably big enough to yield small $\sigma^2_k$. (The difference in estimates obtained with non-repeated CV and 30-times-repeated CV is always small.) With smaller sample sizes one can expect larger between-repetitions variance.

CAVEAT: Confidence intervals!

Another point that the authors are making is that

the reporting of confidence intervals [in repeated cross-validation] is

It seems that they are referring to confidence intervals for the mean across CV repetitions. I fully agree that this is a meaningless thing to report! The more times CV is repeated, the smaller this CI will be, but nobody is interested in the CI around our estimate of $\mu_k$! We care about the CI around our estimate of $\alpha_2$.

The authors also report CIs for the non-repeated CV, and it’s not entirely clear to me how these CIs were constructed. I guess these are the CIs for the means across the $k$ folds. I would argue that these CIs are also pretty much meaningless!

Take a look at one of their examples: the accuracy for adult dataset with NB algorithm and 200 sample size. They get 78.0% with non-repeated CV, CI (72.26, 83.74), 79.0% (77.21, 80.79) with 10-times-repeated CV, and 79.1% (78.07, 80.13) with 30-times-repeated CV. All of these CIs are useless, including the first one. The best estimate of $\mu_k$ is 79.1%. This corresponds to 158 successes out of 200. This yields 95% binomial confidence interval of (72.8, 84.5) — broader even than the first one reported. If I wanted to report some CI, this is the one I would report.

MORE GENERAL CAVEAT: variance of CV.

You wrote that repeated CV

has become a popular technique for reducing the variance of cross-validation.

One should be very clear what one means by the “variance” of CV. Repeated CV reduces the variance of the estimate of $\mu_k$. Note that in case of leave-one-out CV (LOOCV), when $k=N$, this variance is equal to zero. Nevertheless, it is often said that LOOCV has actually the highest variance of all possible $k$-fold CVs. See e.g. here: Variance and bias in cross-validation: why does leave-one-out CV have higher variance?

Why is that? This is because LOOCV has the highest variance as an estimate of $\alpha_1$ which is the expected predictive performance of the model on new data when built on a new dataset of the same size as $S$. This is a completely different issue.