The standard definition of an outlier for a Box and Whisker plot is points outside of the range \left\{Q1-1.5IQR,Q3+1.5IQR\right\}, where IQR= Q3-Q1 and Q1 is the first quartile and Q3 is the third quartile of the data.

What is the basis for this definition? With a large number of points, even a perfectly normal distribution returns outliers.

For example, suppose you start with the sequence:

`xseq<-seq(1-.5^1/4000,.5^1/4000, by = -.00025)`

This sequence creates a percentile ranking of 4000 points of data.

Testing normality for the

`qnorm`

of this series results in:`shapiro.test(qnorm(xseq)) Shapiro-Wilk normality test data: qnorm(xseq) W = 0.99999, p-value = 1 ad.test(qnorm(xseq)) Anderson-Darling normality test data: qnorm(xseq) A = 0.00044273, p-value = 1`

The results are exactly as expected: the normality of a normal distribution is normal. Creating a

`qqnorm(qnorm(xseq))`

creates (as expected) a straight line of data:If a boxplot of the same data is created,

`boxplot(qnorm(xseq))`

produces the result:The boxplot, unlike

`shapiro.test`

,`ad.test`

, or`qqnorm`

identifiesseveralpoints as outliers when the sample size is sufficiently large (as in this example).

**Answer**

## Boxplots

Here is a relevant section from Hoaglin, Mosteller and Tukey (2000): Understanding Robust and Exploratory Data Analysis. Wiley. Chapter 3, “Boxplots and Batch Comparison”, written by John D. Emerson and Judith Strenio (from page 62):

[…] Our definition of outliers as data values that are smaller than

F_{L}-\frac{3}{2}d_{F} or larger than F_{U}+\frac{3}{2}d_{F} is

somewhat arbitrary, but experience with many data sets indicates that

this definition serves well in identifying values that may require

special attention.[…]

F_{L} and F_{U} denote the first and third quartile, whereas d_{F} is the interquartile range (i.e. F_{U}-F_{L}).

They go on and show the application to a Gaussian population (page 63):

Consider the standard Gaussian distribution, with mean 0 and variance

1. We look for population values of this distribution that are analogous to the sample values used in the boxplot. For a symmetric

distribution, the median equals the mean, so the population median of

the standard Gaussian distribution is 0. The population fourths are

-0.6745 and 0.6745, so the population fourth-spread is 1.349, or

about \frac{4}{3}. Thus \frac{3}{2} times the fourth-spread is

2.0235 (about 2). The population outlier cutoffs are \pm 2.698

(about 2\frac{2}{3}), and they contain 99.3\% of the distribution.

[…]

So

[they] show that if the cutoffs are applied to a Gaussian

distribution, then 0.7\% of the population is outside the outlier

cutoffs; this figure provides a standard of comparison for judging the

placement of the outlier cutoffs […].

Further, they write

[…] Thus we can judge whether our data seem heavier-tailed than Gaussian

by how many points fall beyond the outlier cutoffs. […]

They provide a table with the expected proportion of values that fall outside the outlier cutoffs (labelled “Total % Out”):

So these cutoffs where never intended to be a strict rule about what data points are outliers or not. As you noted, even a perfect Normal distribution is expected to exhibit “outliers” in a boxplot.

## Outliers

As far as I know, there is no universally accepted definition of outlier. I like the definition by Hawkins (1980):

An outlier is an observation which deviates so much from the other

observations as to arouse suspicions that it was generated by a

different mechanism.

Ideally, you should only treat data points as outliers once you understand *why* they don’t belong to the rest of the data. A simple rule is not sufficient. A good treatment of outliers can be found in Aggarwal (2013).

## References

Aggarwal CC (2013): Outlier Analysis. Springer.

Hawkins D (1980): Identification of Outliers. Chapman and Hall.

Hoaglin, Mosteller and Tukey (2000): Understanding Robust and Exploratory Data Analysis. Wiley.

**Attribution***Source : Link , Question Author : Tavrock , Answer Author : Nick Cox*