# Is there a simple way of detecting outliers?

I am wondering if there is a simple way of detecting outliers.

For one of my projects, which was basically a correlation between the number of times respondents participate in physical activity in a week and the number of times they eat outside the home (fast food) in a week, I drew a scatterplot and literally removed the data points that were extreme. (The scatterplot showed a negative correlation.)

This was based on value judgement (based on the scatterplot where these data points were clearly extreme). I did not do any statistical tests.

I am just wondering if this is a sound way of dealing with outliers.

I have data from 350 people so loss of (say) 20 data points is not a worry to me.

There is no simple sound way to remove outliers. Outliers can be of two kinds:

1) Data entry errors. These are often the easiest to spot and always the easiest to deal with. If you can find the right data, correct it; if not, delete it.

2) Legitimate data that is unusual. This is much trickier. For bivariate data like yours, the outlier could be univariate or bivariate.

a) Univariate. First, “unusual” depends on the distribution and the sample size. You give us the sample size of 350, but what is the distribution? It clearly isn’t normal, since it’s a relatively small integer. What is unusual under a Poisson would not be under a negative binomial. I’d kind of suspect a zero-inflated negative binomial relationship.

But even when you have the distribution, the (possible) outliers will affect the parameters. You can look at “leave one out” distributions, where you check if data point q would be an outlier if the data had all points but q. Even then, though, what if there are multiple outliers?

b) Bivariate. This is where neither variable’s value is unusual in itself, but together they are odd. There is a possibly apocryphal report that the census once said there were 20,000 12 year old widows in the USA. 12 year olds aren’t unusual, widows aren’t either, but 12 year old widows are.

Given all this, it might be simpler to report a robust measure of relationship.