I have been given a data set that contains the number of awards earned by students at one high school where predictors of the number of awards earned include the type of program in which the student was enrolled and the score on their final exam in maths.
I was wondering if anyone could tell me why a linear regression model may be unsuitable in this instance and why it would be better to use a Poisson regression?
Three points about Poisson vs Normal regression, all concerning model specification:
Effect of changes in predictors
With a continuous predictor like math test score Poisson regression (with the usual log link) implies that a unit change in the predictor leads to a percentage change in the number of awards, i.e. 10 more points on the math test is associated with e.g. 25 percent more awards. This depends on the number of awards the student is already predicted to have. In contrast, Normal regression associates 10 more points with a fixed amount, say 3 more awards under all circumstances. You should be happy with that assumption before using the model that makes it. (fwiw I think it is very reasonable, modulo the next point.)
Dealing with students with no awards
Unless there are really many awards spread over lots of students then your award counts will mostly be be rather low. In fact I would predict zero-inflation, i.e. most students don’t get any award, so lots of zeros, and some good students get quite a few awards. This messes with the assumptions of the Poisson model and is at least as bad for the Normal model.
If you have a decent amount of data a ‘zero-inflated’ or ‘hurdle’ model would then be natural. This is two models tied together: one to predict whether the student gets any awards, and another to predict how many she gets if she gets any at all (usually some form of Poisson model). I would expect all the action to be in the first model.
Finally, a small point about awards. If awards are exclusive, i.e. if one student gets the award then no other students can get the award, then your outcomes are coupled; one count for student a pushes down the possible count of every other one. Whether this is worth worrying about depends on the awards structure and the size of the student population. I’d ignore it at a first pass.
In conclusion, Poisson comfortably dominates Normal except for very large counts, but check the assumptions of the Poisson before leaning on it to heavily for inference, and be prepared to move to a mildly more complex model class if necessary.