How to fit a mixed model with response variable between 0 and 1?

I am trying to use lme4::glmer() to fit a binomial generalized mixed model (GLMM) with dependent variable that is not binary, but a continuous variable between zero and one. One can think of this variable as a probability; in fact it is probability as reported by human subjects (in an experiment that I help analyzing). I.e. it’s not a “discrete” fraction, but a continuous variable.

My glmer() call doesn’t work as expected (see below). Why? What can I do?

Later edit: my answer below is more general than the original version of this question, so I modified the question to be more general too.


More details

Apparently it is possible to use logistic regression not only for binary DV but also for continuous DV between zero and one. Indeed, when I run

glm(reportedProbability ~ a + b + c, myData, family="binomial")

I get a warning message

Warning message:
In eval(expr, envir, enclos) : non-integer #successes in a binomial glm!

but a very reasonable fit (all factors are categorical, so I can easily check whether model predictions are close to the across-subjects-means, and they are).

However, what I actually want to use is

glmer(reportedProbability ~ a + b + c + (1 | subject), myData, family="binomial")

It gives me the identical warning, returns a model, but this model is clearly very much off; the estimates of the fixed effects are very far from the glm() ones and from the across-subject-means. (And I need to include glmerControl(optimizer="bobyqa") into the glmer call, otherwise it does not converge at all.)

Answer

It makes sense to start with a simpler case of no random effects.

There are four ways to deal with continuous zero-to-one response variable that behaves like a fraction or a probability (this is our most canonical/upvoted/viewed thread on this topic, but unfortunately not all four options are discussed there):

  1. If it is a fraction $p=m/n$ of two integers and all $n$s are known, then one can use standard logistic regression, aka binomial GLM. One way to code it in R is (assuming that n is a vector of $N$ values for each data point):

    glm(p ~ a+b+c, myData, family="binomial", weights=n)
    
  2. If $p$ is not a fraction of two integers, then one can use beta regression. This will only work if the observed $p$ is never equal to $0$ or $1$. If it is, then more complicated zero/one-inflated beta models are possible, but this becomes more involved (see this thread).

    betareg(p ~ a+b+c, myData)
    
  3. Logit transform the response and use linear regression. This is usually not advised.

    lm(log(p/(1-p)) ~ a+b+c, myData)
    
  4. Fit a binomial model but then compute standard errors taking over-dispersion into account. The standard errors can be computed in various ways:

    • (a) scaled standard errors via the overdispersion estimate (one, two). This is called “quasi-binomial” GLM.

    • (b) robust standard errors via the sandwich estimator (one, two, three, four). This is called “fractional logit” in econometrics.

    The (a) and (b) are not identical (see this comment, and sections 3.4.1 and 3.4.2 in this book, and this SO post and also this one and this one), but tend to give similar results. Option (a) is implemented in glm as follows:

    glm(p ~ a+b+c, myData, family="quasibinomial")
    

The same four ways are available with random effects.

  1. Using weights argument (one, two):

    glmer(p ~ a+b+c + (1|subject), myData, family="binomial", weights=n)
    

    According to the second link above, it might be a good idea to model overdispersion, see there (and also #4 below).

  2. Using beta mixed model:

    glmmadmb(p ~ a+b+c + (1|subject), myData, family="beta")
    

    or

    glmmTMB(p ~ a+b+c + (1|subject), myData, 
            family=list(family="beta",link="logit"))
    

    If there are exact zeros or ones in the response data, then one can use zero/one-inflated beta model in glmmTMB.

  3. Using logit transform of the response:

    lmer(log(p/(1-p)) ~ a+b+c + (1|subject), myData)
    
  4. Accounting for overdispersion in the binomial model. This uses a different trick: adding a random effect for each data point:

    myData$rowid = as.factor(1:nrow(myData))
    glmer(p ~ a+b+c + (1|subject) + (1|rowid), myData, family="binomial",
          glmerControl(optimizer="bobyqa"))
    

    For some reason this does not work properly as glmer() complains about non-integer p and yields nonsense estimates. A solution that I came up with is to use fake constant weights=k and make sure that p*k is always integer. This requires rounding p but by selecting k that is large enough it should not matter much. The results do not seem to depend on the value of k.

    k = 100
    glmer(round(p*k)/k ~ a+b+c + (1|subject) + (1|rowid), myData, 
          family="binomial", weights=rowid*0+k, glmerControl(optimizer="bobyqa"))
    

    Later update (Jan 2018): This might be an invalid approach. See discussion here. I have to investigate this more.


In my specific case option #1 is not available.

Option #2 is very slow and has issues with converging: glmmadmb takes five-ten minutes to run (and still complains that it did not converge!), whereas lmer works in a split-second and glmer takes a couple of seconds. Update: I tried glmmTMB as suggested in the comments by @BenBolker and it works almost as fast as glmer, without any convergence issues. So this is what I will be using.

Options #3 and #4 yield very similar estimates and very similar Wald confidence intervals (obtained with confint). I am not a big fan of #3 though because it is kind of cheating. And #4 feels somewhat hacky.

Huge thanks to @Aaron who pointed me towards #3 and #4 in his comment.

Attribution
Source : Link , Question Author : amoeba , Answer Author : amoeba

Leave a Comment