I conducted a computer-based assessment of different methods of fitting a particular type of model used in the palaeo sciences. I had a large-ish training set and so I randomly (stratified random sampling) set aside a test set. I fitted m different methods to the training set samples and using the m resulting models I predicted the response for the test set samples and computed a RMSEP over the samples in the test set. This is a single

run.I then repeated this process a large number of times, each time I chose a different training set by randomly sampling a new test set.

Having done this I want to investigate if any of the m methods has better or worse RMSEP performance. I also would like to do multiple comparisons of the pair-wise methods.

My approach has been to fit a linear mixed effects (LME) model, with a single random effect for

Run. I used`lmer()`

from thelme4package to fit my model and functions from themultcomppackage to perform the multiple comparisons. My model was essentially`lmer(RMSEP ~ method + (1 | Run), data = FOO)`

where

`method`

is a factor indicating which method was used to generate the model predictions for the test set and`Run`

is an indicator for each particularRunof my “experiment”.My question is in regard to the residuals of the LME. Given the single random effect for

RunI am assuming that the RMSEP values for that run are correlated to some degree but are uncorrelated between runs, on the basis of the induced correlation the random effect affords.Is this assumption of independence

betweenruns valid? If not is there a way to account for this in the LME model or should I be looking to employ another type of statical analysis to answer my question?

**Answer**

You are essentially doing some form of cross-validation here for each of your *m* methods and would then like to see which method performed better. The results between runs will definitely be dependent, since they are based on the same data and you have overlap between your train/test sets. The question is whether this should matter when you come to compare the methods.

Let’s say you would perform only one run, and would find that one method is better than the others. You would then ask yourself – is this simply due to the specific choice of test set? This is why you repeat your test for many different train/test sets. So, in order to determine that a method is better than other methods, you run many times and in each run compare it to the other methods (you have different options of looking at the error/rank/etc). Now, if you find that a method does better on most runs, the result is what it is. I am not sure it is helpful to give a p-value to this. Or, if you do want to give a p-value, ask yourself what is the background model here?

**Attribution***Source : Link , Question Author : Gavin Simpson , Answer Author : Bitwise*