# Mean(scores) vs Score(concatenation) in cross validation

### TLDR:

My dataset is pretty small (120) samples. While doing 10-fold cross validation, should I:

1. Collect the outputs from each test fold, concatenate them into a vector, and then compute the error on this full vector of predictions (120 samples)?

2. Or should I instead compute the error on the outputs I get on each fold (with 12 samples per fold), and then get my final error estimate as the average of the 10 fold error estimates?

Are there any scientific papers that argue the differences between these techniques?

### Background: Potential Relationship to Macro/Micro scores in multi-label classification:

I think this question may be related to the difference between micro and Macro averages that are often used in a multi-label classification task (e.g. say 5 labels).

In the multi-label setting, micro average scores are computed by making an aggregated contingency table of true positive, false positive, true negative, false negative for all 5 classifier predictions on 120 samples. This contingency table is then used to compute the micro precision, micro recall and micro f-measure. So when we have 120 samples and five classifiers, the micro measures are computed on 600 predictions (120 samples * 5 labels).

When using the Macro variant, one computes the measures (precision, recall, etc.) independently on each label and finally, these measures are averaged.

The idea behind the difference between micro vs Macro estimates may be extended to what can be done in a K-fold setting in a binary classification problem. For 10-fold we can either average over 10 values (Macro measure) or concatenate the 10 experiments and compute the micro measures.

### Background – Expanded example:

The following example illustrates the question. Let’s say we have 12 test samples and we have 10 folds:

• Fold 1: TP = 4, FP = 0, TN = 8 Precision = 1.0
• Fold 2: TP = 4, FP = 0, TN = 8 Precision = 1.0
• Fold 3: TP = 4, FP = 0, TN = 8 Precision = 1.0
• Fold 4: TP = 0, FP = 12, Precision = 0
• Fold 5 .. Fold 10: All have the same TP = 0, FP = 12 and Precision = 0

where I used the following notation:

TP = # of True Positives,
FP = # False Positive,
TN = # of True Negatives

The results are:

• Average precision across 10 folds = 3/10 = 0.3
• Precision on the concatenation of the predictions of the 10 folds = TP/TP+FP = 12/12+84 = 0.125

Note that the values 0.3 and 0.125 are very different!

The described difference is IMHO bogus.

You’ll observe it only if the distribution of truely positive cases (i.e. reference method says it is a positive case) is very unequal over the folds (as in the example) and the number of relevant test cases (the denominator of the performance measure we’re talking about, here the truly positive) is not taken into account when averaging the fold averages.

If you weight the first three fold averages with $\frac{4}{12} = \frac{1}{3}$ (as there were 4 test cases among the total 12 cases which are relevant for calculation of the precision), and the last 6 fold averages with 1 (all test cases relevant for precision calculation), the weighted average is exactly the same you’d get from pooling the predictions of the 10 folds and then calculating the precision.

yes, you should run iterations of the whole $k$-fold cross validation procedure:
From that, you can get an idea of the stability of the predictions of your models

• How much do the predictions change if the training data is perturbed by exchanging a few training samples?
• I.e., how much do the predictions of different “surrogate” models vary for the same test sample?

You were asking for scientific papers:

Underestimating variance
Ultimately, your data set has finite (n = 120) sample size, regardless of how many iterations of bootstrap or cross validation you do.

• You have (at least) 2 sources of variance in the resampling (cross validation and out of bootstrap) validation results:

• variance due to finite number of (test) sample
• variance due to instability of the predictions of the surrogate models
• If your models are stable, then

• iterations of $k$-fold cross validation were not needed (they don’t improve the performance estimate: the average over each run of the cross validation is the same).
• However, the performance estimate is still subject to variance due to the finite number of test samples.
• If your data structure is “simple” (i.e. one single measurement vector for each statistically independent case), you can assume that the test results are the results of a Bernoulli process (coin-throwing) and calculate the finite-test-set variance.
• out-of-bootstrap looks at variance between each surrogate model’s predictions. That is possible with the cross validation results as well, but it is uncommon. If you do this, you’ll see variance due to finite sample size in addition to the instability. However, keep in mind that some pooling has (usually) taken place already: for cross validation usually $\frac{n}{k}$ results are pooled, and for out-of-bootstrap a varying number of left out samples are pooled.
Which makes me personally prefer the cross validation (for the moment) as it is easier to separate instability from finite test sample sizes.