Name of “reshuffle trick” (randomly permute the dataset to estimate the bias of an estimator)

Do you know a reference or name for the following way to investigate if a complex modelling technique $T$ is biased?

  1. Apply $T$ to the original data set. Measure its performance (e.g. R-squared in regression setting).
  2. Randomly permute the response variable to get a new data set. Apply $T$ and measure its performance $P’$. [If the observations are dependent, this step is more complicated.]

If $P’$ is substantially different from zero performance, we conclude $T$ is biased.

Step 2 can be repeated if resources allow, which would lead to the permutation null distribution of the performance measure. But in my application, I cannot do this because of resource problems.

I darkly remember that this “reshuffling” trick was used by someone to investigate the bias of leave-one-out cross-validation (in some setting). I don’t know, however, if he was in my situation where he could repeat the whole process just once.

An example in R that shows the “power” of naive backward selection:

# Generate random data set. Only random performance is expected.
n <- 100
p <- 30

y <- rnorm(n)
X <- rnorm(n*p)
dim(X) <- c(n, p)
data <- data.frame(y, X)

# Modelling technique: backward selection with OLS
T <- function(data) {
  step(lm(y ~ ., data = data), trace = 0)

# Performance: R-squared
P <- function(fit) {

# Step 1: Compute performance on original data. Happily publish high R-squared...
P(T(data)) # 0.240405

# Step 2: Your mean colleague reshuffles response and gets also R-squared far away from 0
data$y <- data$y[sample(n)]
P(T(data)) # 0.1925726

Conclusion on the example: The chosen modeling technique is extremely prone to overfitting, at least in this specific setting.

Some background

I have once used this reshuffling trick to check if cross-validation of some tedious modelling process was properly implemented by me. Under a random permutation, CV gave an R-squared of essentially 0 (as expected/desired).


To answer the question in the title, AFAIK this is called a permutation test. If this is indeed what you are looking for though, it does not work as described in the question.

To be (somewhat) concise: the permutation test indeed works by shuffling one of the ‘columns’ and performing the test or calculation of interest. However, the trick is to do this a lot of times, shuffling the data each time. In small datasets it might even be possible to perform all possible permutations. In large datasets you usually perform an amount of permutation your computer can handle, but which is large enough to obtain a distribution of the statistic of interest.

Finally, you use this distribution to check whether, for example, the mean difference between two groups is >0 in 95% of the distribution. Simply put, this latter step of checking which part of the distribution is above/below a certain critical value is the ‘p-value’ for your hypothesis test.

If this is wildly different from the p-value in the original sample, I wouldn’t say there’s something wrong with the test/statistic of interest, but rather your sample containing certain datapoints which specifically influence the test result. This might be bias (selection bias due to including some weird cases; measurement error in specific cases, etc.), or it might be incorrect use of the test (e.g. violated assumptions).

See for further details

Moreover, see @amoeba ‘s answer to this question If you want to know more about how to combine permutation tests with variable selection.

Source : Link , Question Author : Michael M , Answer Author : IWS

Leave a Comment