Problems with Outlier Detection

In a blog post Andrew Gelman writes:

Stepwise regression is one of these things, like outlier detection and
pie charts, which appear to be popular among non-statisticians but are
considered by statisticians to be a bit of a joke.

I understand the reference to pie charts, but why is outlier detection looked down upon by statisticians according to Gelman? Is it just that it might cause people to over-prune their data?

Answer

@Jerome Baum’s comment is spot on. To bring the Gelman quote here:

Outlier detection can be a good thing. The problem is that
non-statisticians seem to like to latch on to the word “outlier”
without trying to think at all about the process that creates the
outlier, also some textbooks have rules that look stupid to
statisticians such as myself, rules such as labeling something as an
outlier if it more than some number of sd’s from the median, or
whatever. The concept of an outlier is useful but I think it requires
context—if you label something as an outlier, you want to try to get
some sense of why you think that.

To add a little bit more, how about we first define outlier. Try to do so rigorously without referring to anything visual like “looks like it’s far away from other points”. It’s actually quite hard.

I’d say that an outlier is a point that is highly unlikely given a model of how points are generated. In most situations, people don’t actually have a model of how the points are generated, or if they do it is so over-simplified as to be wrong much of the time. So, as Andrew says, people will do things like assume that some kind of Gaussian process is generating points and so if a point is more than a certain number of SD’s from the mean, it’s an outlier. Mathematically convenient, not so principled.

And we haven’t even gotten into what people do with outliers once they are identified. Most people want to throw these inconvenient points away, for example. In many cases, it’s the outliers that lead to breakthroughs and discoveries, not the non-outliers!

There’s a lot of ad-hoc’ery in outlier detection, as practiced by non-statisticians, and Andrew is uncomfortable with that.

Attribution
Source : Link , Question Author : 114 , Answer Author : Wayne

Leave a Comment