When using k-fold CV to select among regression models, I usually compute the CV error separately for each model, together with its standard error SE, and I select the simplest model within 1 SE of the model with the lowest CV error ( the 1 standard error rule, see for example here). However, I’ve recently been told that in this way I’m overestimating the variability, and that in the specific case of selecting between two models A and B, I should really proceed in a different way:

- for each fold K of length NK, compute the pointwise differences between the two models predictions.Then compute the mean square difference for the fold MSDK=√∑NKi=1(ˆyAi−ˆyBi)2NK
- average MSDK across folds as usual, and use this CV difference error (together with its standard error) as an estimator for the generalization error.
Questions:

- Does this make sense to you? I know there are theoretical reasons behind the use of CV error as an estimator of generalization error (I don’t know which are these reasons, but I know they exist!). I have no idea if there are theoretical reasons behind the use of this “difference” CV error.
- I don’t know if this can be generalized to the comparisons of more than two models. Computing the differences for all pairs of models seems risky (multiple comparisons?): what would you do if you had more than two models?
EDIT: my formula is totally wrong, the correct metric is described here and it’s much more complicated. Well, I’m happy I asked here before blindingly applying the formula! I thank @Bay for helping me understand with his\her illuminating answer. The correct measure described is quite experimental, so I will stick to my trusted work-horse, the CV error!

**Answer**

The MSDK is an odd measure of generalization error, since the holdout set doesn’t even come into the picture. All this will tell you is how correlated the model’s predictions are with each other, but nothing about how well either actually predicts the test data point.

For example, I could come up with a dumb pair of predictors:

ˆyA(x,θ)=1+⟨x,1⟩θ

ˆyB(x,θ):=1+⟨x,1⟩θ2

In this case, tuning on cross validation would tell me to set θ has large as possible since that would drive down the MSDK, but I doubt these models would be good predictors.

I took a look at the link, but I didn’t see your MSDK measure there. Andrew Gelman is a well-respected statistician, so I doubt he’d endorse something like the above, which clearly fails as an estimator of generalization error. His paper and the link discuss Leave One Out (LOO) cross validation, which still requires a comparison with a test data point (i.e., held-out from training) as the benchmark. The MSDK is a purely “inward” looking metric that won’t tell you anything about the expected test error (except perhaps that the two models may have similar errors…).

Response to OP comment

The formula presented in your comment requires a bit of context:

- It is a Bayesian measure of accuracy, in that
*elpd*is the*expected log pointwise predictive density*– quite a mouthful, but basically, it is the sum of expected values of the logarithm of the posterior predictive density evaluated at each data point under some prior predictive density that is estimated using cross validation. - The above measure (elpd) is calculated using leave one out cross-validation, where the predictive density is taken at the omitted point.
- What their formula (19) is doing is calculating the standard error of the difference in predictive accuracy (measured using elpd) between two models. The idea is that the difference in elpd’s is asymptoticallly normal, so the standard error has inferential meaninig (and can be used to test if the underlying difference is zero), or is Model A has a smaller prediction error than Model B.

So, there are a lot of moving parts to this measure: You need to have run an MCMC sampling algorithm to get points from the posterior parameter density. You then need to integrate it to get predictive densities. Then you need to take expected values of each of these (over many draws). Its quite a process, but in the end it’s supposed to give a useful standard error.

**Note:** In the third full paragraph below equation (19), the authors state that more research is needed to determine if this approach performs well for model comparison…so, its not well tested yet (highly experimental). Thus, you are basically trusting in the usefulness of this method until follow-up studies verify it reliably identifies the better model (in terms of *elpd*).

**Attribution***Source : Link , Question Author : DeltaIV , Answer Author : Community*