Why is high positive kurtosis problematic for hypothesis tests?

I’ve heard (sorry cannot provide a link to a text, something I have been told) that a high positive kurtosis of residuals can be problematic for accurate hypothesis tests and confidence intervals (and therefore problems with statistical inference). Is this true and, if so, why? Would a high positive kurtosis of residuals not indicate that the majority of the residuals are near the residual mean of 0 and therefore less large residuals are present? (If you have an answer, please try to give an answer with not much indepth mathematics as I’m not highly mathematically inclined).

heard […] that a high positive kurtosis of residuals can be problematic for accurate hypothesis tests and confidence intervals (and therefore problems with statistical inference). Is this true and, if so, why?

For some kinds of hypothesis test, it’s true.

Would a high positive kurtosis of residuals not indicate that the majority of the residuals are near the residual mean of 0 and therefore less large residuals are present?

No.

It looks like you’re conflating the concept of variance with that of kurtosis. If the variance were smaller, then a tendency to more small residuals and fewer large residuals would come together. Imagine we hold the standard deviation constant while we change the kurtosis (so we’re definitely talking about changes to kurtosis rather than to variance).

Compare different variances (but the same kurtosis):

with different kurtosis but the same variance:

(images from this post)

A high kurtosis is in many cases associated with more small deviations from the mean$$‡^\ddagger$$ — more small residuals than you’d find with a normal distribution .. but to keep the standard deviation at the same value, we must also have more big residuals (because having more small residuals would make the typical distance from the mean smaller). To get more of both the big residuals and small residuals, you will have fewer “typical sized” residuals — those about one standard deviation away from the mean.

$$‡\ddagger$$ it depends on how you define “smallness”; you can’t simply add lots of large residuals and hold variance constant, you need something to compensate for it — but for some given measure of “small” you can find ways to increase the kurtosis without increasing that particular measure. (For example, higher kurtosis doesn’t automatically imply a higher peak as such)

A higher kurtosis tends to go with more large residuals, even when you hold the variance constant.

[Further, in some cases, the concentration of small residuals may actually lead to more of a problem than the additional fraction of the largest residuals — depending on what things you’re looking at.]

Anyway, let’s look at an example. Consider a one-sample t-test and a sample size of 10.

If we reject the null hypothesis when the absolute value of the t-statistic is bigger than 2.262, then when the observations are independent, identically distributed from a normal distribution, and the hypothesized mean is the true population mean, we’ll reject the null hypothesis 5% of the time.

Consider a particular distribution with substantially higher kurtosis than the normal: 75% of our population have their values drawn from a normal distribution and the remaining 25% have their values drawn from a normal distribution with standard deviation 50 times as large.

If I calculated correctly, this corresponds to a kurtosis of 12 (an excess kurtosis of 9). The resulting distribution is much more peaked than the normal and has heavy tails. The density is compared with the normal density below — you can see the higher peak, but you can’t really see the heavier tail in the left image, so I also plotted the logarithm of the densities, which stretches out the lower part of the image and compresses the top, making it easier to see both the peak and the tails.

The actual significance level for this distribution if you carry out a “5%” one-sample t-test with $$n=10n=10$$ is below 0.9%. This is pretty dramatic, and pulls down the power curve quite substantially.

(You’ll also see a substantive effect on the coverage of confidence intervals.)

Note that a different distribution with the same kurtosis as that will have a different impact on the significance level.

So why does the rejection rate go down? It’s because the heavier tail leads to a few large outliers, which has slightly larger impact on the standard deviation than it does on the mean; this impacts the t-statistic because it leads to more t-values between -1 and 1, in the process reducing the proportion of values in the critical region.

If you take a sample that looks pretty consistent with having come from a normal distribution whose mean is just far enough above the hypothesized mean that it’s significant, and then you take the observation furthest above the mean and pull it even further away (that is, make the mean even larger than under $$H0H_0$$), you actually make the t-statistic smaller.

Let me show you. Here’s a sample of size 10:

 1.13 1.68 2.02 2.30 2.56 2.80 3.06 3.34 3.68 4.23


Imagine we want to test it against $$H0:μ=2H_0: \mu=2$$ (a one-sample t-test). It turns out that the sample mean here is 2.68 and the sample standard deviation is 0.9424. You get a t-statistic of 2.282 — just in the rejection region for a 5% test (p-value of 0.0484).

Now make that largest value 50:

      1.13 1.68 2.02 2.30 2.56 2.80 3.06 3.34 3.68 50


Clearly we pull the mean up, so it should indicate a difference even more than it did before, right? Well, no, it doesn’t. The t-statistic goes down. It is now 1.106, and the p-value is quite large (close to 30%). What happened? Well, we did pull the mean up (to 7.257), but the standard deviation shot up over 15.

Standard deviations are a bit more sensitive to outliers than means are — when you put in an outlier, you tend to push the one-sample t-statistic toward 1 or -1.

If there’s a chance of several outliers, much the same happens only they can sometimes be on opposite sides (in which case the standard deviation is even more inflated while the impact on the mean is reduced compared to one outlier), so the t-statistic tends to move closer to 0.

Similar stuff goes on with a number of other common tests that assume normality — higher kurtosis tends to be associated with heavier tails, which means more outliers, which means that standard deviations get inflated relative to means and so differences you want to pick up tend to get “swamped” by the impact of the outliers on the test. That is, low power.