Input format for response in binomial glm in R

In R, there are three methods to format the input data for a logistic regression using the glm function:

  1. Data can be in a “binary” format for each observation (e.g., y = 0 or 1 for each observation);
  2. Data can be in the “Wilkinson-Rogers” format (e.g., y = cbind(success, failure)) with each row representing one treatment; or
  3. Data can be in a weighted format for each observation (e.g., y = 0.3, weights = 10).

All three approach produces the same coefficient estimates, but differ in the degrees of freedom and resulting deviance values and AIC scores. The last two methods have fewer observations (and therefore degrees of freedom) because they use each treatment for the number of observations whereas the first uses each observation for the number of observations.

My question: Are there numerical or statistical advantages to using one input format over another? The only advantage I see is not having to reformat one’s data in R to use with the model.

I have looked at the glm documentation, searched on the web, and this site and found one tangentially related post, but no guidance on this topic.

Here is a simulated example that demonstrates this behavior:

# Write function to help simulate data
drc4 <- function(x, b =1.0, c = 0, d = 1, e = 0){
    (d - c)/ (1 + exp(-b * (log(x)  - log(e))))
}
# simulate long form of dataset
nReps = 20
dfLong <- data.frame(dose = rep(seq(0, 10, by = 2), each = nReps))
dfLong$mortality <-rbinom(n = dim(dfLong)[1], size = 1,
                              prob = drc4(dfLong$dose, b = 2, e = 5))

# aggregate to create short form of dataset
dfShort <- aggregate(dfLong$mortality, by = list(dfLong$dose), 
                     FUN = sum)
colnames(dfShort) <- c("dose", "mortality")
dfShort$survival <- nReps - dfShort$mortality 
dfShort$nReps <- nReps
dfShort$mortalityP <- dfShort$mortality / dfShort$nReps

fitShort <- glm( cbind(mortality, survival) ~ dose, 
                 data = dfShort, 
                 family = "binomial")
summary(fitShort)

fitShortP <- glm( mortalityP ~ dose, data = dfShort, 
                  weights = nReps,     
                  family = "binomial")
summary(fitShortP)

fitLong <- glm( mortality ~ dose, data = dfLong, 
                family = "binomial")
summary(fitLong)

Answer

There’s no statistical reason to prefer one to the other, besides conceptual clarity. Although the reported deviance values are different, these differences are completely due to the saturated model. So any comparison using relative deviance between models is unaffected, since the saturated model log-likelihood cancels.

I think it’s useful to go through the explicit deviance calculation.

The deviance of a model is 2*(LL(Saturated Model) – LL(Model)). Suppose you have i different cells, where ni is the number of observations in cell i, pi is the model prediction for all observations in cell i, and yij is the observed value (0 or 1) for the j-th observation in cell i.

Long Form

The log likelihood of the (proposed or null) model is
ij(log(pi)yij+log(1pi)(1yij))

and the log likelihood of the saturated model is ij(log(yij)yij+log(1yij)(1yij)). This is equal to 0, because yij is either 0 or 1. Note log(0) is undefined, but for convenience please read 0log(0) as shorthand for lim, which is 0.

Short form (weighted)

Note that a binomial distribution can’t actually take non-integer values, but we can nonetheless calculate a “log likelihood” by using the fraction of observed successes in each cell as the response, and weighting each summand in the log-likelihood calculation by the number of observations in that cell.

\sum_in_i \left(\log(p_i)\sum_jy_{ij}/n_i + \log(1 – p_i)(1 – \sum_j(y_{ij}/n_i)\right)

This is exactly equal to the model deviance we calculated above, which you can see by pulling in the sum over j in the long form equation as far as possible.

Meanwhile the saturated deviance is different. Since we no longer have 0-1 responses, even with one parameter per observation we can’t get exactly 0. Instead the saturated model log-likelihood is

\sum_i n_i\left(\log(\sum_jy_{ij}/n_i)\sum_jy_{ij}/n_i + \log(1 – \sum_jy_{ij}/n_i)(1-\sum_jy_{ij}/n_i)\right).

In your example, you can verify that twice this amount is the difference between the reported null and residual deviance values for both models.

ni = dfShort$nReps
yavg = dfShort$mortalityP
sum.terms <-ni*(log(yavg)*yavg + log(1 - yavg)*(1 - yavg))
# Need to handle NaN when yavg is exactly 0
sum.terms[1] <- log(1 - yavg[1])*(1 - yavg[1])

2*sum(sum.terms)
fitShortP$deviance - fitLong$deviance

Attribution
Source : Link , Question Author : Richard Erickson , Answer Author : gung – Reinstate Monica

Leave a Comment