Mixed model vs. Pooling Standard Errors for Multi-site Studies – Why is a Mixed Model So Much More Efficient?

I’ve got a data set consisting of a series of “broken stick” monthly case counts from a handful of sites. I’m trying to get a single summary estimate from two different techniques:

Technique 1: Fit a “broken stick” with a Poisson GLM with a 0/1 indicator variable, and using a time and time^2 variable to control for trends in time. That 0/1 indicator variable’s estimate and SE are pooled using a pretty straight up and down method of moments technique, or using the tlnise package in R to get a “Bayesian” estimate. This is similar to what Peng and Dominici do with air pollution data, but with fewer sites (~a dozen).

Technique 2: Abandon some of the site-specific control for trends in time and use a linear mixed model. Particularly:

lmer(cases ~ indicator + (1+month+I(month^2) + offset(log(p)), family="poisson", data=data)


My question involves the standard errors that come out of these estimates. Technique 1’s standard error, which is actually using a weekly rather than monthly time set and thus should have more precision, has a standard error on the estimate of ~0.206 for the Method of Moments approach and ~0.306 for the tlnise.

The lmer method gives a standard error of ~0.09. The effect estimates are reasonably close, so it doesn’t seem to be that they’re just zeroing in on different summary estimates as much as the mixed model is vastly more efficient.

Is that something that’s reasonable to expect? If so, why are mixed models so much more efficient? Is this a general phenomena, or a specific result of this model?

Answer

I know this is an old question, but it’s relatively popular and has a simple answer, so hopefully it’ll be helpful to others in the future. For a more in-depth take, take a look at Christoph Lippert’s course on Linear Mixed Models which examines them in the context of genome-wide association studies here. In particular see Lecture 5.

The reason that the mixed model works so much better is that it’s designed to take into account exactly what you’re trying to control for: population structure. The “populations” in your study are the different sites using, for example, slightly different but consistent implementations of the same protocol. Also, if the subjects of your study are people, people pooled from different sites are less likely to be related than people from the same site, so blood-relatedness may play a role as well.

As opposed to the standard maximum-likelihood linear model where we have $\mathcal{N}(Y|X\beta,\sigma^2)$, linear mixed models add in an additional matrix called the kernel matrix $K$, which estimates the similarity between individuals, and fits the “random effects” so that similar individuals will have similar random effects. This gives rise to the model $\mathcal{N}(Y|X\beta + Zu,\sigma^2I + \sigma_g^2K)$.

Because you are trying to control for population structure explicitly, it’s therefore no surprise that the linear mixed model outperformed other regression techniques.

Attribution
Source : Link , Question Author : Fomite , Answer Author : Michael K