I am trying to use
lme4::glmer()to fit a binomial generalized mixed model (GLMM) with dependent variable that is not binary, but a continuous variable between zero and one. One can think of this variable as a probability; in fact it is probability as reported by human subjects (in an experiment that I help analyzing). I.e. it’s not a “discrete” fraction, but a continuous variable.
glmer()call doesn’t work as expected (see below). Why? What can I do?
Later edit: my answer below is more general than the original version of this question, so I modified the question to be more general too.
Apparently it is possible to use logistic regression not only for binary DV but also for continuous DV between zero and one. Indeed, when I run
glm(reportedProbability ~ a + b + c, myData, family="binomial")
I get a warning message
Warning message: In eval(expr, envir, enclos) : non-integer #successes in a binomial glm!
but a very reasonable fit (all factors are categorical, so I can easily check whether model predictions are close to the across-subjects-means, and they are).
However, what I actually want to use is
glmer(reportedProbability ~ a + b + c + (1 | subject), myData, family="binomial")
It gives me the identical warning, returns a model, but this model is clearly very much off; the estimates of the fixed effects are very far from the
glm()ones and from the across-subject-means. (And I need to include
glmercall, otherwise it does not converge at all.)
It makes sense to start with a simpler case of no random effects.
There are four ways to deal with continuous zero-to-one response variable that behaves like a fraction or a probability (this is our most canonical/upvoted/viewed thread on this topic, but unfortunately not all four options are discussed there):
If it is a fraction $p=m/n$ of two integers and all $n$s are known, then one can use standard logistic regression, aka binomial GLM. One way to code it in R is (assuming that
nis a vector of $N$ values for each data point):
glm(p ~ a+b+c, myData, family="binomial", weights=n)
If $p$ is not a fraction of two integers, then one can use beta regression. This will only work if the observed $p$ is never equal to $0$ or $1$. If it is, then more complicated zero/one-inflated beta models are possible, but this becomes more involved (see this thread).
betareg(p ~ a+b+c, myData)
Logit transform the response and use linear regression. This is usually not advised.
lm(log(p/(1-p)) ~ a+b+c, myData)
Fit a binomial model but then compute standard errors taking over-dispersion into account. The standard errors can be computed in various ways:
The (a) and (b) are not identical (see this comment, and sections 3.4.1 and 3.4.2 in this book, and this SO post and also this one and this one), but tend to give similar results. Option (a) is implemented in
glm(p ~ a+b+c, myData, family="quasibinomial")
The same four ways are available with random effects.
glmer(p ~ a+b+c + (1|subject), myData, family="binomial", weights=n)
According to the second link above, it might be a good idea to model overdispersion, see there (and also #4 below).
Using beta mixed model:
glmmadmb(p ~ a+b+c + (1|subject), myData, family="beta")
glmmTMB(p ~ a+b+c + (1|subject), myData, family=list(family="beta",link="logit"))
If there are exact zeros or ones in the response data, then one can use zero/one-inflated beta model in
Using logit transform of the response:
lmer(log(p/(1-p)) ~ a+b+c + (1|subject), myData)
Accounting for overdispersion in the binomial model. This uses a different trick: adding a random effect for each data point:
myData$rowid = as.factor(1:nrow(myData)) glmer(p ~ a+b+c + (1|subject) + (1|rowid), myData, family="binomial", glmerControl(optimizer="bobyqa"))
For some reason this does not work properly as
glmer()complains about non-integer
pand yields nonsense estimates. A solution that I came up with is to use fake constant
weights=kand make sure that
p*kis always integer. This requires rounding
pbut by selecting
kthat is large enough it should not matter much. The results do not seem to depend on the value of
k = 100 glmer(round(p*k)/k ~ a+b+c + (1|subject) + (1|rowid), myData, family="binomial", weights=rowid*0+k, glmerControl(optimizer="bobyqa"))
Later update (Jan 2018): This might be an invalid approach. See discussion here. I have to investigate this more.
In my specific case option #1 is not available.
Option #2 is very slow and has issues with converging: Update: I tried
glmmadmb takes five-ten minutes to run (and still complains that it did not converge!), whereas
lmer works in a split-second and
glmer takes a couple of seconds.
glmmTMB as suggested in the comments by @BenBolker and it works almost as fast as
glmer, without any convergence issues. So this is what I will be using.
Options #3 and #4 yield very similar estimates and very similar Wald confidence intervals (obtained with
confint). I am not a big fan of #3 though because it is kind of cheating. And #4 feels somewhat hacky.
Huge thanks to @Aaron who pointed me towards #3 and #4 in his comment.