What is the basis for the Box and Whisker Plot definition of an outlier?

The standard definition of an outlier for a Box and Whisker plot is points outside of the range \left\{Q1-1.5IQR,Q3+1.5IQR\right\}, where IQR= Q3-Q1 and Q1 is the first quartile and Q3 is the third quartile of the data.

What is the basis for this definition? With a large number of points, even a perfectly normal distribution returns outliers.

For example, suppose you start with the sequence:

xseq<-seq(1-.5^1/4000,.5^1/4000, by = -.00025)

This sequence creates a percentile ranking of 4000 points of data.

Testing normality for the qnorm of this series results in:

shapiro.test(qnorm(xseq))

    Shapiro-Wilk normality test

data:  qnorm(xseq)
W = 0.99999, p-value = 1

ad.test(qnorm(xseq))

    Anderson-Darling normality test

data:  qnorm(xseq)
A = 0.00044273, p-value = 1

The results are exactly as expected: the normality of a normal distribution is normal. Creating a qqnorm(qnorm(xseq)) creates (as expected) a straight line of data:

qqnorm plot of data

If a boxplot of the same data is created, boxplot(qnorm(xseq)) produces the result:

boxplot of the data

The boxplot, unlike shapiro.test, ad.test, or qqnorm identifies several points as outliers when the sample size is sufficiently large (as in this example).

Answer

Boxplots

Here is a relevant section from Hoaglin, Mosteller and Tukey (2000): Understanding Robust and Exploratory Data Analysis. Wiley. Chapter 3, “Boxplots and Batch Comparison”, written by John D. Emerson and Judith Strenio (from page 62):

[…] Our definition of outliers as data values that are smaller than
F_{L}-\frac{3}{2}d_{F} or larger than F_{U}+\frac{3}{2}d_{F} is
somewhat arbitrary, but experience with many data sets indicates that
this definition serves well in identifying values that may require
special attention.[…]

F_{L} and F_{U} denote the first and third quartile, whereas d_{F} is the interquartile range (i.e. F_{U}-F_{L}).

They go on and show the application to a Gaussian population (page 63):

Consider the standard Gaussian distribution, with mean 0 and variance
1. We look for population values of this distribution that are analogous to the sample values used in the boxplot. For a symmetric
distribution, the median equals the mean, so the population median of
the standard Gaussian distribution is 0. The population fourths are
-0.6745 and 0.6745, so the population fourth-spread is 1.349, or
about \frac{4}{3}. Thus \frac{3}{2} times the fourth-spread is
2.0235 (about 2). The population outlier cutoffs are \pm 2.698
(about 2\frac{2}{3}), and they contain 99.3\% of the distribution.
[…]

So

[they] show that if the cutoffs are applied to a Gaussian
distribution, then 0.7\% of the population is outside the outlier
cutoffs; this figure provides a standard of comparison for judging the
placement of the outlier cutoffs […].

Further, they write

[…] Thus we can judge whether our data seem heavier-tailed than Gaussian
by how many points fall beyond the outlier cutoffs. […]

They provide a table with the expected proportion of values that fall outside the outlier cutoffs (labelled “Total % Out”):

Table 3-2

So these cutoffs where never intended to be a strict rule about what data points are outliers or not. As you noted, even a perfect Normal distribution is expected to exhibit “outliers” in a boxplot.


Outliers

As far as I know, there is no universally accepted definition of outlier. I like the definition by Hawkins (1980):

An outlier is an observation which deviates so much from the other
observations as to arouse suspicions that it was generated by a
different mechanism.

Ideally, you should only treat data points as outliers once you understand why they don’t belong to the rest of the data. A simple rule is not sufficient. A good treatment of outliers can be found in Aggarwal (2013).

References

Aggarwal CC (2013): Outlier Analysis. Springer.
Hawkins D (1980): Identification of Outliers. Chapman and Hall.
Hoaglin, Mosteller and Tukey (2000): Understanding Robust and Exploratory Data Analysis. Wiley.

Attribution
Source : Link , Question Author : Tavrock , Answer Author : Nick Cox

Leave a Comment