# Is it at all defensible to stratify a data set by the size of the residual and do a two-sample comparison?

This is something I’m seeing done as sort of an ad-hoc method and it seems very fishy to me but perhaps I am missing something. I’ve seen this done in multiple regression but let’s just keep it simple:

Now take the residuals from the fitted model

and stratify the sample based on the size of the residuals. For example, say the first sample is the bottom 90% of the residuals and the second sample is the top 10%, then proceed to do two sample comparisons – I’ve seen this done both on the predictor in the model, $x$, and on variables not in the model. The informal logic used is that perhaps points that have values far above what you’d expect under the model (i.e. a large residual) are different in some way, and that difference is investigated in this way.

My thoughts on the subject are:

• If you see a 2-sample difference on a predictor in the model, then there are effects of the predictor not accounted for by the model in its current state (i.e. non-linear effects).
• If you see a 2-sample difference on a variable not in the model, then maybe it should’ve been in the model in the first place.

One thing I’ve found empirically (through simulations) is that, if you’re comparing the mean of a predictor in the model $x$ and stratify in this way to produce the two sample means, $\overline{x}_{1}$ and $\overline{x}_{2}$, they are positively correlated with each other. This makes sense since both samples depend on the $\overline{y}, \overline{x}, \hat{\sigma}_{x}, \hat{\sigma}_{y}$ and $\hat{\rho}_{xy}$. That correlation increases as you move the cutoff down (i.e. the % you use to divide the sample). So at the very least, if you’re going to do a two-sample comparison the standard error in the denominator of the $t$-statistic needs to be adjusted to account for the correlation (although I haven’t derived an explicit formula for the covariance).

Anyhow, my basic question is: Is there any rationale for doing this? If so, in what situations could this be a useful thing to do? Clearly I don’t think there is but there may be something I’m not thinking of in the right way.

Comparing the means is too weak: instead, compare the distributions.

There is also a question concerning whether it is more desirable to compare the sizes of the residuals (as stated) or to compare the residuals themselves. Therefore, I evaluate both.

To be specific about what is meant, here is some R code to compare $(x,y)$ data (given in parallel arrays x and y) by regressing $y$ on $x$, dividing the residuals into three groups by cutting them below quantile $q_0$ and above quantile $q_1\gt q_0$, and (by means of a qq plot) comparing the distributions of $x$ values associated with those two groups.

test <- function(y, x, q0, q1, abs0=abs, ...) {
y.res <- abs0(residuals(lm(y~x)))
y.groups <- cut(y.res, quantile(y.res, c(0,q0,q1,1)))
x.groups <- split(x, y.groups)
xy <- qqplot(x.groups[[1]], x.groups[[3]], plot.it=FALSE)
lines(xy, xlab="Low residual", ylab="High residual", ...)
}


The fifth argument to this function, abs0, by default uses the sizes (absolute values) of the residuals to form the groups. Later we can replace that by a function that uses the residuals themselves.

Residuals are used to detect many things: outliers, possible correlations with exogenous variables, goodness of fit, and homoscedasticity. Outliers, by their nature, should be few and isolated and thus are not going to play a meaningful role here. To keep this analysis simple, let’s explore the last two: goodness of fit (that is, linearity of the $x$$y$ relationship) and homoscedasticity (that is, constancy of the size of the residuals). We can do this through simulation:

simulate <- function(n, beta0=0, beta1=1, beta2=0, sd=1, q0=1/3, q1=2/3, abs0=abs,
n.trials=99, ...) {
x <- 1:n - (n+1)/2
y <- beta0 + beta1 * x + beta2 * x^2 + rnorm(n, sd=sd)
plot(x,y, ylab="y", cex=0.8, pch=19, ...)
plot(x, res <- residuals(lm(y ~ x)), cex=0.8, col="Gray", ylab="", main="Residuals")
res.abs <- abs0(res)
r0 <- quantile(res.abs, q0); r1 <- quantile(res.abs, q1)
points(x[res.abs < r0], res[res.abs < r0], col="Blue")
points(x[res.abs > r1], res[res.abs > r1], col="Red")
plot(x,x, main="QQ Plot of X",
xlab="Low residual", ylab="High residual",
type="n")
abline(0,1, col="Red", lwd=2)
temp <- replicate(n.trials, test(beta0 + beta1 * x + beta2 * x^2 + rnorm(n, sd=sd),
x, q0=q0, q1=q1, abs0=abs0, lwd=1.25, lty=3, col="Gray"))
test(y, x, q0=q0, q1=q1, abs0=abs0, lwd=2, col="Black")
}


This code accepts arguments determining the linear model: its coefficients $y \sim \beta_0 + \beta_1 x + \beta_2 x^2$, the standard deviations of the error terms sd, the quantiles $q_0$ and $q_1$, the size function abs0, and the number of independent trials in the simulation, n.trials. The first argument n is the amount of data to simulate in each trial. It produces a set of plots–of the $(x,y)$ data, of their residuals, and qq plots of multiple trials–to help us understand how the proposed tests work for a given model (as determined by n, the beta,s and sd). Examples of these plots appear below.

Let us now use these tools to explore some realistic combinations of nonlinearity and heteroscedasticity, using the absolute values of the residuals:

n <- 100
beta0 <- 1
beta1 <- -1/n
sigma <- 1/n

size <- function(x) abs(x)
set.seed(17)
par(mfcol=c(3,4))
simulate(n, beta0, beta1, 0, sigma*sqrt(n), abs0=size, main="Linear Homoscedastic")
simulate(n, beta0, beta1, 0, 0.5*sigma*(n:1), abs0=size, main="Linear Heteroscedastic")
simulate(n, beta0, beta1, 1/n^2, sigma*sqrt(n), abs0=size, main="Quadratic Homoscedastic")
simulate(n, beta0, beta1, 1/n^2, 5*sigma*sqrt(1:n), abs0=size, main="Quadratic Heteroscedastic")


The output is a set of plots. The top row shows one simulated dataset, the second row shows a scatterplot of its residuals against $x$ (color-coded by quantile: red for large values, blue for small values, gray for any intermediate values not further used), and the third row shows the qq plots for all trials, with the qq plot for the one simulated dataset shown in black. An individual qq plot compares the $x$ values associated with high residuals to the $x$ values associated with low residuals; after many trials, a gray envelope of likely qq plots emerges. We are interested in how, and how strongly, these envelopes vary with departures from the basic linear model: strong variation implies good discrimination.

The differences between the last three and the first columns make it clear this method is able to detect heteroscedasticity, but it might not be so effective in identifying a moderate nonlinearity. It could easily confuse the nonlinearity with heteroscedasticity. This is because the form of heteroscedasticity simulated here (which is common) is one where the expected sizes of the residuals trend with $x$. That trend is easy to detect. Quadratic nonlinearity, on the other hand, will create large residuals at both ends and in the middle of the range of $x$ values. This is hard to distinguish just by looking at the distributions of the affected $x$ values.

Let’s do the same thing, using exactly the same data, but analyzing the residuals themselves. To do this, the previous block of code was rerun after making this modification:

size <- function(x) x


This variation does not detect heteroscedasticity well: see how similar the qq plots are in the first two columns. However, it does a good job of detecting nonlinearity. This is because the residuals separate the $x$‘s into a middle portion and an outer portion, which will be fairly different. As shown in the rightmost column, however, heteroscedasticity can mask nonlinearities.

Perhaps combining both these techniques would work. These simulations (and variations of them, which the interested reader can run at leisure) demonstrate that these techniques are not without merit.

In general, however, one is far better served by examining the residuals in standard ways. For automated work, formal tests have been developed to detect the kinds of things we look for in residual plots. For instance, the Breusch-Pagan test regresses the squared residuals (rather than their absolute values) against $x$. The tests proposed in this question can be understood in the same spirit. However, by binning the data into just two groups and thereby neglecting most of the bivariate information afforded by the $(x, \hat{y}-x)$ pairs, we can expect the proposed tests to be less powerful than regression-based tests like the Breusch-Pagan.