Is leave-one-out cross validation (LOOCV) known to systematically overestimate error?

Let’s assume that we want to build a regression model that needs to predict the temperature in a build. We start from a very simple model in which we assume that the temperature only depends on weekday.

Now we want to use k-fold validation to check if our hypothesis is valid. Now, for each weekday we calculate mean temperature using the whole data set. However, when we do the leave-one-out validation, we take one observation and calculate the mean without this particular observation. As a result, whenever an observation goes up, the corresponding prediction (mean calculated with the remaining values) goes down. So, we have an anticorrelation between observations and predictions and it should obviously decrease the accuracy of the model.

So, my question is: Is it a known effect and how to deal with it?


This effect not only occurs in leave-one-out but k-fold cross-validation (CV) in general. Your training and your validation sets are not independent because any observation being allocated to your validation set obviously influences your training set (since it is being taken out from it).

To which extend this is the case depends on your data and predictor. To make a very simple example using your task regarding the daily temperature using leave-one-out: If your data only contained a single (the same) value n times, then your mean predictor would always predict the correct value in all n folds. And if you used a predictor taking the maximum value from the training set (for prediction and calculating the true values), then your model would be correct in n1 folds (only the fold which removes the maximum value from the train dataset would be predicted incorrectly). I.e. there are predictors and datasets where leave-one-out may be more or less suitable.

Specifically your mean-estimator has two properties:

  1. It depends on all examples in the train set (i.e. in real world non-trivial datasets (unlike my example above) it will predict a different value in each fold). A maximum-predictor, for example, would not show this behavior.
  2. It is sensitive to outliers (i.e. removing an extremely high or low value in one of the folds will have a relatively large impact on your prediction). A median-predictor, for example, would not show this behavior to the same extent.

This means your mean-predictor is somewhat unstable per design. Which you can either accept (especially in case the observed variance is not significantly large) or choose a different predictor instead. However, as pointed out earlier this also depends on your dataset. If your dataset is small and of high variance, instability of the mean-predictor will increase. So having a sufficiently sized dataset with proper pre-processing (potentially removing outliers) could be another way to approach this. Also, I’d keep in mind that there is no perfect method to measure accuracy.

The paper A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection is a good starting point for this topic. It focuses on classification but will still be a good read to get more details and further readings on the topic.

Source : Link , Question Author : Roman , Answer Author : Sammy

Leave a Comment