Is the mean sensitive to the presence of outliers? I initially thought it wasn’t, because a small amount of observations shouldn’t have much impact, but was told that since those observations have very different values from the rest, they have a considerable impact. Thoughts?
Consider what would happen if you wanted to take the mean of some numbers, but you dragged one of them off toward infinity. Sure, at first it wouldn’t have a huge impact on the mean, but the farther you drag it off, the more your mean changes.
Every number has a (proportionally) small contribution to the mean, but they do all contribute. So if one number is really different than the others, it can still have a big influence.
This idea of dragging values off toward infinity and seeing how the estimator behaves is formalized by the breakdown point: the proportion of data that can get arbitrarily large before the estimator also becomes arbitrarily large.
The mean has a breakdown point of 0, because it only takes 1 bad data point to make the whole estimator bad (this is actually the asymptotic breakdown point, the finite sample breakdown point is 1/N).
On the other hand, the median has breakdown point 0.5 because it doesn’t care about how strange data gets, as long as the middle point doesn’t change. You can take half of the data and make it arbitrarily large and the median shrugs it off.
You can even construct an estimator with whatever breakdown point you want (between 0 and 0.5) by ‘trimming’ the mean by that percentage–throwing away some of the data before computing the mean.
So, what does this mean for actually doing work? Is the mean just a terrible idea? Well, like everything else in life, it depends. If you desperately need to protect yourself against outliers, yeah, the mean probably isn’t for you. But the median pays a price of losing a lot of potentially helpful information to get that high breakdown point.
If you’re interested in reading more about it, here’s a set of lecture notes that really helped me when I was learning about robust statistics.