What is the basis for the Box and Whisker Plot definition of an outlier?

The standard definition of an outlier for a Box and Whisker plot is points outside of the range \left\{Q1-1.5IQR,Q3+1.5IQR\right\}, where IQR= Q3-Q1 and Q1 is the first quartile and Q3 is the third quartile of the data.

What is the basis for this definition? With a large number of points, even a perfectly normal distribution returns outliers.

For example, suppose you start with the sequence:

xseq<-seq(1-.5^1/4000,.5^1/4000, by = -.00025)

This sequence creates a percentile ranking of 4000 points of data.

Testing normality for the qnorm of this series results in:


    Shapiro-Wilk normality test

data:  qnorm(xseq)
W = 0.99999, p-value = 1


    Anderson-Darling normality test

data:  qnorm(xseq)
A = 0.00044273, p-value = 1

The results are exactly as expected: the normality of a normal distribution is normal. Creating a qqnorm(qnorm(xseq)) creates (as expected) a straight line of data:

qqnorm plot of data

If a boxplot of the same data is created, boxplot(qnorm(xseq)) produces the result:

boxplot of the data

The boxplot, unlike shapiro.test, ad.test, or qqnorm identifies several points as outliers when the sample size is sufficiently large (as in this example).



Here is a relevant section from Hoaglin, Mosteller and Tukey (2000): Understanding Robust and Exploratory Data Analysis. Wiley. Chapter 3, “Boxplots and Batch Comparison”, written by John D. Emerson and Judith Strenio (from page 62):

[…] Our definition of outliers as data values that are smaller than
F_{L}-\frac{3}{2}d_{F} or larger than F_{U}+\frac{3}{2}d_{F} is
somewhat arbitrary, but experience with many data sets indicates that
this definition serves well in identifying values that may require
special attention.[…]

F_{L} and F_{U} denote the first and third quartile, whereas d_{F} is the interquartile range (i.e. F_{U}-F_{L}).

They go on and show the application to a Gaussian population (page 63):

Consider the standard Gaussian distribution, with mean 0 and variance
1. We look for population values of this distribution that are analogous to the sample values used in the boxplot. For a symmetric
distribution, the median equals the mean, so the population median of
the standard Gaussian distribution is 0. The population fourths are
-0.6745 and 0.6745, so the population fourth-spread is 1.349, or
about \frac{4}{3}. Thus \frac{3}{2} times the fourth-spread is
2.0235 (about 2). The population outlier cutoffs are \pm 2.698
(about 2\frac{2}{3}), and they contain 99.3\% of the distribution.


[they] show that if the cutoffs are applied to a Gaussian
distribution, then 0.7\% of the population is outside the outlier
cutoffs; this figure provides a standard of comparison for judging the
placement of the outlier cutoffs […].

Further, they write

[…] Thus we can judge whether our data seem heavier-tailed than Gaussian
by how many points fall beyond the outlier cutoffs. […]

They provide a table with the expected proportion of values that fall outside the outlier cutoffs (labelled “Total % Out”):

Table 3-2

So these cutoffs where never intended to be a strict rule about what data points are outliers or not. As you noted, even a perfect Normal distribution is expected to exhibit “outliers” in a boxplot.


As far as I know, there is no universally accepted definition of outlier. I like the definition by Hawkins (1980):

An outlier is an observation which deviates so much from the other
observations as to arouse suspicions that it was generated by a
different mechanism.

Ideally, you should only treat data points as outliers once you understand why they don’t belong to the rest of the data. A simple rule is not sufficient. A good treatment of outliers can be found in Aggarwal (2013).


Aggarwal CC (2013): Outlier Analysis. Springer.
Hawkins D (1980): Identification of Outliers. Chapman and Hall.
Hoaglin, Mosteller and Tukey (2000): Understanding Robust and Exploratory Data Analysis. Wiley.

Source : Link , Question Author : Tavrock , Answer Author : Nick Cox

Leave a Comment