# What is the basis for the Box and Whisker Plot definition of an outlier?

The standard definition of an outlier for a Box and Whisker plot is points outside of the range $\left\{Q1-1.5IQR,Q3+1.5IQR\right\}$, where $IQR= Q3-Q1$ and $Q1$ is the first quartile and $Q3$ is the third quartile of the data.

What is the basis for this definition? With a large number of points, even a perfectly normal distribution returns outliers.

xseq<-seq(1-.5^1/4000,.5^1/4000, by = -.00025)


This sequence creates a percentile ranking of 4000 points of data.

Testing normality for the qnorm of this series results in:

shapiro.test(qnorm(xseq))

Shapiro-Wilk normality test

data:  qnorm(xseq)
W = 0.99999, p-value = 1

Anderson-Darling normality test

data:  qnorm(xseq)
A = 0.00044273, p-value = 1


The results are exactly as expected: the normality of a normal distribution is normal. Creating a qqnorm(qnorm(xseq)) creates (as expected) a straight line of data:

If a boxplot of the same data is created, boxplot(qnorm(xseq)) produces the result:

The boxplot, unlike shapiro.test, ad.test, or qqnorm identifies several points as outliers when the sample size is sufficiently large (as in this example).

## Boxplots

Here is a relevant section from Hoaglin, Mosteller and Tukey (2000): Understanding Robust and Exploratory Data Analysis. Wiley. Chapter 3, “Boxplots and Batch Comparison”, written by John D. Emerson and Judith Strenio (from page 62):

[…] Our definition of outliers as data values that are smaller than
$$F_{L}-\frac{3}{2}d_{F}F_{L}-\frac{3}{2}d_{F}$$ or larger than $$F_{U}+\frac{3}{2}d_{F}F_{U}+\frac{3}{2}d_{F}$$ is
somewhat arbitrary, but experience with many data sets indicates that
this definition serves well in identifying values that may require
special attention.[…]

$$F_{L}F_{L}$$ and $$F_{U}F_{U}$$ denote the first and third quartile, whereas $$d_{F}d_{F}$$ is the interquartile range (i.e. $$F_{U}-F_{L}F_{U}-F_{L}$$).

They go on and show the application to a Gaussian population (page 63):

Consider the standard Gaussian distribution, with mean $$00$$ and variance
$$11$$. We look for population values of this distribution that are analogous to the sample values used in the boxplot. For a symmetric
distribution, the median equals the mean, so the population median of
the standard Gaussian distribution is $$00$$. The population fourths are
$$-0.6745-0.6745$$ and $$0.67450.6745$$, so the population fourth-spread is $$1.3491.349$$, or
about $$\frac{4}{3}\frac{4}{3}$$. Thus $$\frac{3}{2}\frac{3}{2}$$ times the fourth-spread is
$$2.02352.0235$$ (about $$22$$). The population outlier cutoffs are $$\pm 2.698\pm 2.698$$
(about $$2\frac{2}{3}2\frac{2}{3}$$), and they contain $$99.3\%99.3\%$$ of the distribution.
[…]

So

[they] show that if the cutoffs are applied to a Gaussian
distribution, then $$0.7\%0.7\%$$ of the population is outside the outlier
cutoffs; this figure provides a standard of comparison for judging the
placement of the outlier cutoffs […].

Further, they write

[…] Thus we can judge whether our data seem heavier-tailed than Gaussian
by how many points fall beyond the outlier cutoffs. […]

They provide a table with the expected proportion of values that fall outside the outlier cutoffs (labelled “Total % Out”):

So these cutoffs where never intended to be a strict rule about what data points are outliers or not. As you noted, even a perfect Normal distribution is expected to exhibit “outliers” in a boxplot.

## Outliers

As far as I know, there is no universally accepted definition of outlier. I like the definition by Hawkins (1980):

An outlier is an observation which deviates so much from the other
observations as to arouse suspicions that it was generated by a
different mechanism.

Ideally, you should only treat data points as outliers once you understand why they don’t belong to the rest of the data. A simple rule is not sufficient. A good treatment of outliers can be found in Aggarwal (2013).

## References

Aggarwal CC (2013): Outlier Analysis. Springer.
Hawkins D (1980): Identification of Outliers. Chapman and Hall.
Hoaglin, Mosteller and Tukey (2000): Understanding Robust and Exploratory Data Analysis. Wiley.