This seems so elementary, but I always get stuck at this point…
Most of the data I deal with are non-normal, and most of the analyses based on a GLM structure. For my current analysis, I have a response variable that is “walking speed” (meters/minute). It’s easy for me to identify that I cannot use OLS, but then, I have great uncertainty in deciding what family (Gamma, Weibull, etc.) is appropriate!
I use Stata and look at diagnostics like residuals and heteroscedasticity, residuals vs. fitted values, etc.
I am aware that count data can take the form of a rate (e.g. incidence rates) and have used gamma (the analog to overdispersed discrete negative binomial models), but just would like a “smoking gun” to say YES, YOU HAVE THE RIGHT FAMILY. Is looking at the standardized residuals versus the fitted values the only, and best way, to do this? I would like to use a mixed model to account for some hierarchy in the data as well, but first need to sort out what family best describes my response variable.
Any help appreciated. Stata language especially appreciated!
I have some tips :
(1) How residuals ought to compare to fits isn’t always all that obvious, so it’s good to be familiar with diagnostics for particular models. In logistic regression models, for example, the Hosmer-Lemeshow statistic is used to assess goodness of fit; leverage values tend to be small where the estimated odds are very large, very small or about even; & so on.
(2) Sometimes one family of models can be seen as a special case of another, so you can use a hypothesis test on a parameter to help you choose. Exponential vs Weibull, for example.
(3) Akaike’s Information Criterion is useful in choosing between different models, which includes choosing between different families.
(4) Theoretical/empirical knowledge about what you’re modelling narrows the field of plausible models.
But there’s no automatic way of finding the ‘right’ family; real-life data can come from distributions as complicated as you like, & the complexity of models that are worth trying to fit increases with the amount of data you have. This is part & parcel of Box’s dictum that no models are true but some are useful.
Re @gung’s comment: it appears the commonly used Hosmer-Lemeshow test is (a) surprisingly sensitive to the choice of bins, & (b) generally less powerful than some other tests against some relevant classes of alternative hypothesis. That doesn’t detract from point (1): it’s also good to be up-to-date.