I have a set of numbers which are assumed to be coming from a Poisson distribution. The set has some outliers also and because of that, maximum likelihood estimates are badly affected. I heard that robust estimation procedures can help in such a situation. Can any one explain how to do this? I am not a statistics student.
I found that the
glmrobfunction in R can be used for this. (I am quite new to R). But I could not figure out, how to use that despite reading the manual pages. In particular I am unable to understand how to get a
forumulawhich is the first argument to the glmrob function.
@cardinal has telegraphed an answer in comments. Let’s flesh it out. His point is that although general linear models (such as implemented by
lm and, in this case,
glmRob) appear intended to evaluate relationships among variables, they can be powerful tools for studying a single variable, too. The trick relies on the fact that regressing data against a constant is just another way of estimating its average value (“location”).
As an example, generate some Poisson-distributed data:
set.seed(17) x <- rpois(10, lambda=2)
In this case,
R will produce the vector (1,5,2,3,2,2,1,1,3,1) of values for
x from a Poisson distribution of mean 2. Estimate its location with
library(robust) glmrob(x ~ 1, family=poisson())
The response tells us the intercept is estimated at 0.7268. Of course, anyone using a statistical method needs to know how it works: when you use generalized linear models with the Poisson family, the standard “link” function is the logarithm. This means the intercept is the logarithm of the estimated location. So we compute
The result, 2.0685, is comfortably close to 2: the procedure seems to work. To see what it is doing, plot the data:
plot(x, ylim=c(0, max(x))) abline(exp(0.7268), 0, col="red")
The fitted line is purely horizontal and therefore estimates the middle of the vertical values: our data. That’s all that’s going on.
To check robustness, let’s create a bad outlier by tacking a few zeros onto the first value of
x <- 100
This time, for greater flexibility in post-processing, we will save the output of
m <- glmrob(x ~ 1, family=poisson())
To obtain the estimated average we can request
The value this time equals 2.496: a little off, but not too far off, given that the average value of
x (obtained as
mean(x)) is 12. That is the sense in which this procedure is “robust.” More information can be obtained via
Its output shows us, among other things, that the weight associated with the outlying value of 100 in
x is just 0.02179, almost 0, pinpointing the suspected outlier.