I’m trying to fit a regression to explain the number of homicides in each district of a city.
Although I know that my data follows a Poisson distribution, I tried to fit an OLS like this:
Then, I also tried (of course!) a Poisson regression. The problem is that I have better results in the OLS regression: the pseudo-R2 is higher (0.71 vs 0.57) and the RMSE as well (3.8 vs 8.88. Standardized to have the same unit).
Why? Is it normal? What’s wrong on using the OLS no matter what the distribution of the data is?
Following the suggestions of kjetil b halvorsen and others, I fitted the data through two models: OLS and Negative Binomial GLM (NB). I started with all the features I have, then I recursively removed one by one the features which were not significant.
with weights = area.
summary(w <- lm(sqrt(num/area) ~ RNR_nres_non_daily + RNR_nres_daily + hType_mix_std + area_filtr + num_community_places+ num_intersect + pop_rat_num + employed + emp_rat_pop + nden_daily + nden_non_daily+ bld_rat_area + bor_rat_area + mdist_highways+ mdist_parks, data=p, weights=area)) error2 <- p$num - (predict(w, newdata=p[,-1:-2], type="response")**2)*p$area rmse(error2)  80.64783
The NB predicts the number of crime, with the district’s area as offset.
summary(m3 <- glm.nb(num ~ LUM5_single + RNR_nres + mdist_daily + mdist_non_daily+ hType_mix_std + ratio_daily_nondaily_area + area_filtr + num_community_places + employed + nden_daily + nden_non_daily+ bld_rat_area + bor_rat_area + mdist_smallparks + mdist_highways+ mdist_parks + offset(log(area)), data=p, maxit = 1000)) error <- p$num - predict(m3, newdata=p[,-1:-2], type="response") rmse(error)  121.8714
So the RMSE is lower in the OLS but it seems that the residuals are not so Normal….
I suspect that part of the problem may lie in your choice of performance metric. If you measure test performance using RMSE then training the model to minimise the MSE matches the test criterion, giving a hint as to what is considered important. You may find that if you measure the test performance using the negative log-likelihood of the test set using a Poisson likelihood that the Poisson model works better (as might be expected). This may be a minor issue compared with the other issues raised, but it might be a useful sanity check.