This wikipedia link lists a number of techniques to detect OLS residuals heteroscedasticity. I would like to learn which hands-on technique is more efficient in detecting regions affected by heteroscedasticity.
For example, here the central region in the OLS ‘Residuals vs Fitted’ plot seen to have higher variance than the sides of the plot (I am not entirely sure in facts, but let’s assume it’s the case for the sake of the question). To confirm, looking at the error labels in the QQ plot we can see that they match the error labels in the center of the Residuals plot.
But how can we quantify the residuals region that has significantly higher variance?
This problem has an exploratory feel to it. John Tukey describes many procedures for exploring heteroscedasticity in his classic, Exploratory Data Analysis (Addison-Wesley 1977). Perhaps the most directly useful is a variant of his “wandering schematic plot.” This slices one variable (such as the predicted value) into bins and uses m-letter summaries (generalizations of boxplots) to show the location, spread, and shape of the other variable for each bin. The m-letter statistics are further smoothed in order to emphasize overall patterns rather than chance deviations.
A quick version can be cooked up by exploiting the
boxplot procedure in
R. We illustrate with simulated strongly heteroscedastic data:
set.seed(17) n <- 500 x <- rgamma(n, shape=6, scale=1/2) e <- rnorm(length(x), sd=abs(sin(x))) y <- x + e
Let’s obtain the predicted values and residuals from the OLS regression:
fit <- lm(y ~ x) res <- residuals(fit) pred <- predict(fit)
Here, then, is the wandering schematic plot using equal-count bins for the predicted values. I use
lowess for a quick-and-dirty smooth.
n.bins <- 17 bins <- cut(pred, quantile(pred, probs = seq(0, 1, 1/n.bins))) b <- boxplot(res ~ bins, boxwex=1/2, main="Residuals vs. Predicted", xlab="Predicted", ylab="Residual") colors <- hsv(seq(2/6, 1, 1/6)) temp <- sapply(1:5, function(i) lines(lowess(1:n.bins, b$stats[i,], f=.25), col=colors[i], lwd=2))
The blue curve smooths the medians. Its horizontal tendency indicates the regression is generally a good fit. The other curves smooth the box ends (quartiles) and fences (which are typically extreme values). Their strong convergence and subsequent separation testify to the heteroscedasticity–and help us characterize and quantify it.
(Notice the nonlinear scale on the horizontal axis, reflecting the distribution of predicted values. With a bit more work this axis could be linearized, which sometimes is useful.)