In comparison with a standard gaussian random variable, does a distribution with heavy tails have higher kurtosis?

Under a standard gaussian distribution (mean 0 and variance 1), the kurtosis is 3. Compared to a heavy tail distribution, is the kurtosis normally larger or smaller?


I. A direct answer to the OP

Answer: It depends on what you mean by “heavy tails.” By some definitions of “heavy tails,” the answer is “no,” as pointed out here and elsewhere.

Why do we care about heavy tails? Because we care about outliers (substitute the phrase “rare, extreme observation” if you have a problem with the word “outlier.” However, I will use the term “outlier” throughout for brevity.) Outliers are interesting from several points of view: In finance, outlier returns cause much more money to change hands than typical returns (see Taleb‘s discussion of black swans). In hydrology, the outlier flood will cause enormous damage and needs to be planned for. In statistical process control, outliers indicate “out of control” conditions that warrant immediate investigation and rectification. In regression analysis, outliers have enormous effects on the least squares fit. In statistical inference, the degree to which distributions produce outliers has an enormous effect on standard t tests for mean values. Similarly, the degree to which a distribution produces outliers has an enormous effect on the accuracy of the usual estimate of the variance of that distribution.

So for various reasons, there is a great interest in outliers in data, and in the degree to which a distribution produces outliers. Notions of heavy-tailedness were therefore developed to characterize outlier-prone processes and data.

Unfortunately, the commonly-used definition of “heavy tails” involving exponential bounds and asymptotes is too limited in its characterization of outliers and outlier-prone data generating processes: It requires tails extending to infinity, so it rules out bounded distributions that produce outliers. Further, the standard definition does not even apply to a data set, since all empirical distributions are necessarily bounded.

Here is an alternative class of definitions of ”heavy-tailedness,” which I will call “tail-leverage(m)” to avoid confusion with existing definitions of heavy-tailedness, that addresses this concern.

Definition: Assume absolute moments up to order m>2 exist for random variables X and Y. Let U=|(XμX)/σX|m and let V=|(YμY)/σY|m. If E(V)>E(U), then Y is said to have greater tail-leverage(m) than X.

The mathematical rationale for the definition is as follows: Suppose E(V)>E(U), and let μU=E(U). Draw the pdf (or pmf, in the discrete case, or in the case of an actual data set) of V, which is pV(v). Place a fulcrum at μU on the horizontal axis. Because of the well-known fact that the distribution balances at its mean, the distribution pV(v) “falls to the right” of the fulcrum at μU. Now, what causes it to “fall to the right”? Is it the concentration of mass less than 1, corresponding to the observations of Y that are within a standard deviation of the mean? Is it the shape of the distribution of Y corresponding to observations that are within a standard deviation of the mean? No, these aspects are to the left of the fulcrum, not to the right. It is the extremes of the distribution (or data) of Y, in one or both tails, that produce high positive values of V, which cause the “falling to the right.”

BTW, the term “leverage” should now be clear, given the physical representation involving the fulcrum. But it is worth noting that, in the characterization of the distribution “falling to the right,” that the “tail leverage” measures can legitimately be called measures of “tail weight.” I chose not to do that because the “leverage” term is more precise.

Much has been made of the fact that kurtosis does not correspond directly to the standard definition of “heavy tails.” Of course it doesn’t. Neither does it correspond to any but one of the infinitely many definitions of “tail leverage” I just gave. If you restrict your attention to the case where m=4, then an answer to the OP’s question is as follows:

Greater tail leverage (using m=4 in the definition) does indeed imply greater kurtosis (and conversely). They are identical.

Incidentally, the “leverage” definition applies equally to data as it does to distributions: When you apply the kurtosis formula to the empirical distribution, it gives you the estimate of kurtosis without all the so-called “bias corrections.” (This estimate has been compared to others and is reasonable, often better in terms of accuracy; see “Comparing Measures of Sample Skewness and Kurtosis,” D. N. Joanes and C. A. Gill, Journal of the Royal Statistical Society. Series D (The Statistician) Vol. 47, No. 1 (1998), pp. 183-189.)

My stated leverage definition also resolves many of the various comments and answers given in response to the OP: Some beta distributions can be more greatly tail-leveraged (even if “thin-tailed” by other measures) than the normal distribution. This implies a greater outlier potential of such distributions than the normal, as described above regarding leverage and the fulcrum, despite the normal distribution having infinite tails and the beta being bounded. Further, uniforms mixed with classical “heavy-tailed” distributions are still “heavy-tailed,” but can have less tail leverage than the normal distribution, provided the mixing probability on the “heavy tailed” distribution is sufficiently low so that the extremes are very uncommon, and assuming finite moments.

Tail leverage is simply a measure of the extremes (or outliers). It differs from the classic definition of heavy-tailedness, even though it is arguably a viable competitor. It is not perfect; a notable flaw is that it requires finite moments, so quantile-based versions would be useful as well. Such alternative definitions are needed because the classic definition of “heavy tails” is far too limited to characterize the universe of outlier-prone data-generating processes and their resulting data.

II. My paper in The American Statistician

My purpose in writing the paper “Kurtosis as Peakedness, 1905-2014: R.I.P.” was to help people answer the question, “What does higher (or lower) kurtosis tell me about my distribution (or data)?” I suspected the common interpretations (still seen, by the way), “higher kurtosis implies more peaked, lower kurtosis implies more flat” were wrong, but could not quite put my finger on the reason. And, I even wondered that maybe they had an element of truth, given that Pearson said it, and even more compelling, that R.A. Fisher repeated it in all revisions of his famous book. However, I was not able to connect any math to the statement that higher (lower) kurtosis implied greater peakedness (flatness). All the inequalities went in the wrong direction.

Then I hit on the main theorem of my paper. Contrary to what has been stated or implied here and elsewhere, my article was not an “opinion” piece; rather, it was a discussion of three mathematical theorems. Yes, The American Statistician (TAS) does often require mathematical proofs. I would not have been able to publish the paper without them. The following three theorems were proven in my paper, although only the second was listed formally as a “Theorem.”

Main Theorem: Let ZX=(XμX)/σX and let κ(X)=E(Z4X) denote the kurtosis of X. Then for any distribution (discrete, continuous or mixed, which includes actual data via their discrete empirical distribution), E{Z4XI(|ZX|>1)}κ(X)E{Z4XI(|ZX|>1)}+1.

This is a rather trivial theorem to prove but has major consequences: It states that the shape of the distribution within a standard deviation of the mean (which ordinarily would be where the “peak” is thought to be located) contributes very little to the kurtosis. Instead, the theorem implies that for all data and distributions, kurtosis must lie within ±0.5 of E{Z4XI(|ZX|>1)}+0.5.

A very nice visual image of this theorem by user “kjetil b Halvorsen” is given at; see my comment that follows as well.


The bound is sharpened in the Appendix of my TAS paper:

Refined Theorem: Assume X is continuous and that the density of Z2X is decreasing on [0,1]. Then the “+1” of the main theorem can be sharpened to “+0.5”.

This simply amplifies the point of the main theorem that kurtosis is mostly determined by the tails.

More recently, @sextus-empiricus was able to reduce the “+0.5” bound to “+1/3“, see .

A third theorem proven in my TAS paper states that large kurtosis is mostly determined by (potential) data that are b standard deviations away from the mean, for arbitrary b.

Theorem 3: Consider a sequence of random variables Xi,i=1,2,, for which κ(Xi). Then E{Z4iI(|Zi|>b)}/κ(Xi)1, for each b>0.

The third theorem states that high kurtosis is mostly determined by the most extreme outliers; i.e., those observations that are b or more standard deviations from the mean.

These are mathematical theorems, so there can be no argument with them. Supposed “counterexamples” given in this thread and in other online sources are not counterexamples; after all, a theorem is a theorem, not an opinion.

So what of one suggested “counterexample,” where spiking the data with many values at the mean (which thereby increases “peakedness”) causes greater kurtosis? Actually, that example just makes the point of my theorems: When spiking the data in this way, the variance is reduced, thus the observations in the tails are more extreme, in terms of number of standard deviations from the mean. And it is observations with large standard deviation from the mean, according to the theorems in my TAS paper, that cause high kurtosis. It’s not the peakedness. Or to put it another way, the reason that the spike increases kurtosis is not because of the spike itself, it is because the spike causes a reduction in the standard deviation, which makes the tails more standard deviations from the mean (i.e., more extreme), which in turn increases the kurtosis.

It simply cannot be stated that higher kurtosis implies greater peakedness, because you can have a distribution that is perfectly flat over an arbitrarily high percentage of the data (pick 99.99% for concreteness) with infinite kurtosis. (Just mix a uniform with a Cauchy suitably; there are some minor but trivial and unimportant technical details regarding how to make the peak absolutely flat.) By the same construction, high kurtosis can be associated with any shape whatsoever for 99.99% of the central distribution – U-shaped, flat, triangular, multi-modal, etc.

There is also a suggestion in this thread that the center of the distribution is important, because throwing out the central data of the Cauchy example in my TAS paper makes the data have low kurtosis. But this is also due to outliers and extremes: In throwing out the central portion, one increases the variance so that the extremes are no longer extreme (in terms of Z values), hence the kurtosis is low.

Any supposed “counterexample” actually obeys my theorems. Theorems have no counterexamples; otherwise, they would not be theorems.

A more interesting exercise than “spiking” or “deleting the middle” is this: Take the distribution of a random variable X (discrete or continuous, so it includes the case of actual data), and replace the mass/density within one standard deviation of the mean arbitrarily, but keep the mean and standard deviation of the resulting distribution the same as that of X.

Q: How much change can you make to the kurtosis statistic over all such possible replacements?

A: The difference between the maximum and minimum kurtosis values over all such replacements is 0.25.

The above question and its answer comprise yet another theorem. Anyone want to publish it? I have its proof written down (it’s quite elegant, as well as constructive, identifying the max and min distributions explicitly), but I lack the incentive to submit it as I am now retired. I have also calculated the actual max differences for various distributions of X; for example, if X is normal, then the difference between the largest and smallest kurtosis is over all replacements of the central portion is 0.141. Hardly a large effect of the center on the kurtosis statistic!

On the other hand, if you keep the center fixed, but replace the tails, keeping the mean and standard deviation constant, you can make the kurtosis infinitely large. Thus, the effect on kurtosis of manipulating the center while keeping the tails constant, is 0.25. On the other hand, the effect on kurtosis of manipulating the tails, while keeping the center constant, is infinite.

So, while yes, I agree that spiking a distribution at the mean does increase the kurtosis, I do not find this helpful to answer the question, “What does higher kurtosis tell me about my distribution?” There is a difference between “A implies B” and “B implies A.” Just because all bears are mammals does not imply that all mammals are bears. Just because spiking distribution increases kurtosis does not imply that increasing kurtosis implies a spike; see the uniform/Cauchy example alluded to above in my answer.

It is precisely this faulty logic that caused Pearson to make the peakedness/flatness interpretations in the first place. He saw a family of distributions for which the peakedness/flatness interpretations held, and wrongly generalized. In other words, he observed that a bear is a mammal, and then wrongly inferred that a mammal is a bear. Fisher followed suit forever, and here we are.

A case in point: People see this picture of “standard symmetric PDFs” (on Wikipedia at and think it generalizes to the “flatness/peakedness” conclusions.

Wikipedia image

Yes, in that family of distributions, the flat distribution has the lower kurtosis and the peaked one has the higher kurtosis. But it is an error to conclude from that picture that high kurtosis implies peaked and low kurtosis implies flat. There are other examples of low kurtosis (less than the normal distribution) distributions that are infinitely peaked, and there are examples of infinite kurtosis distributions that are perfectly flat over an arbitrarily large proportion of the observable data.

The bear/mammal conundrum also arises in the Finucan conditions, which state (oversimplified) that if tail probability and peak probability increase (losing some mass in between to maintain the standard deviation), then kurtosis increases. This is all fine and good, but you cannot turn the logic around and say that increasing kurtosis implies increasing tail and peak mass (and reducing what is in between). That is precisely the fatal flaw with the sometimes-given interpretation that kurtosis measures the “movement of mass simultaneously to the tails and peak but away from the shoulders.” Again, all mammals are not bears. A good counterexample to that interpretation is given here in “counterexample #1, which shows a family of distributions in which the kurtosis increases to infinity, while the mass inside the center stays constant. (There is also a counterexample #2 that has the mass in the center increasing to 1.0 yet the kurtosis decreases to its minimum, so the often-made assertion that kurtosis measures “concentration of mass in the center” is wrong as well.) Many people think that higher kurtosis implies “more probability in the tails.” This is not true; counterexample #1 shows that you can have higher kurtosis with less tail probability when the tails extend.

So what does kurtosis measure? It precisely measures tail leverage (which can be called tail weight as well) as amplified through fourth powers, as I stated above with my definition of tail-leverage(m).

I would just like to reiterate that my TAS article was not an opinion piece. It was instead a discussion of mathematical theorems and their consequences. There is much additional supportive material in the current post that has come to my attention since writing the TAS article, and I hope readers find it to be helpful for understanding kurtosis.

Source : Link , Question Author : user321627 , Answer Author : BigBendRegion

Leave a Comment