In using the cbind() function in R for a logistic regression on a 2×22 \times 2 table, what is the explicit functional form of the regression equation?

Suppose I have a 2×2 table that looks like:

            Disease       No Disease
Treatment         55                67
Control           42                34

I would like to do a logistic regression in R on this table. I understand that the standard way is to use the glm function with a cbind function in the response. In other words, the code looks like:

glm(formula = cbind(c(55,67),c(42,34)) ~ as.factor(c(1, 0)), family = binomial())

I am wondering why R requires us to use the cbind function and why simply using proportions is not sufficient. Is there a way to write this out explicitly as a formula? What would it look in the form of:

log(p1p)=β0+β1X

where X=1 if we have treatment and X=0 for control?

Right now it seems like I am regressing on a matrix for the dependent value.

Answer

First I show how you can specify a formula using aggregated data with proportions and weights. Then I show how you could specify a formula after dis-aggregating your data to individual observations.

Documentation inglm indicates that:

“For a binomial GLM prior weights are used to give the number of
trials when the response is the proportion of successes”

I create new columns total and proportion_disease in df for the ‘number of trials’ and ‘proportion of successes’ respectively.

library(dplyr)
df <- tibble(treatment_status = c("treatment", "no_treatment"),
       disease = c(55, 42),
       no_disease = c(67,34)) %>% 
  mutate(total = no_disease + disease,
         proportion_disease = disease / total) 

model_weighted <- glm(proportion_disease ~ treatment_status, data = df, family = binomial("logit"), weights = total)

The above weighted approach takes in aggregated data and will give the same solution as the cbind method but allows you to specify a formula. (Below is equivalent to Original Poster’s method but cbind(c(55,42), c(67,34)) rather than cbind(c(55,67),c(42,34)) so that ‘Disease’ rather than ‘Treatment’ is the response variable.)

model_cbinded <- glm(cbind(disease, no_disease) ~ treatment_status, data = df, family = binomial("logit"))  

You could also just dis-aggregate your data into individual observations and pass these into glm (allowing you to specify a formula as well).

df_expanded <- tibble(disease_status = c(1, 1, 0, 0), 
                      treatment_status = rep(c("treatment", "control"), 2)) %>%
                        .[c(rep(1, 55), rep(2, 42), rep(3, 67), rep(4, 34)), ]

model_expanded <- glm(disease_status ~ treatment_status, data = df_expanded, family = binomial("logit"))

Let’s compare these now by passing each model into summary. model_weighted and model_cbinded both produce the exact same results. model_expanded produces the same coefficients and standard errors, though outputs different degrees of freedom, deviance, AIC, etc. (corresponding with the number of rows/observations).

    > lapply(list(model_weighted, model_cbinded, model_expanded), summary)
[[1]]

Call:
glm(formula = proportion_disease ~ treatment_status, family = binomial("logit"), 
    data = df, weights = total)

Deviance Residuals: 
[1]  0  0

Coefficients:
                          Estimate Std. Error z value Pr(>|z|)
(Intercept)                 0.2113     0.2307   0.916    0.360
treatment_statustreatment  -0.4087     0.2938  -1.391    0.164

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1.9451e+00  on 1  degrees of freedom
Residual deviance: 1.0658e-14  on 0  degrees of freedom
AIC: 14.028

Number of Fisher Scoring iterations: 2


[[2]]

Call:
glm(formula = cbind(disease, no_disease) ~ treatment_status, 
    family = binomial("logit"), data = df)

Deviance Residuals: 
[1]  0  0

Coefficients:
                          Estimate Std. Error z value Pr(>|z|)
(Intercept)                 0.2113     0.2307   0.916    0.360
treatment_statustreatment  -0.4087     0.2938  -1.391    0.164

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1.9451e+00  on 1  degrees of freedom
Residual deviance: 1.0658e-14  on 0  degrees of freedom
AIC: 14.028

Number of Fisher Scoring iterations: 2


[[3]]

Call:
glm(formula = disease_status ~ treatment_status, family = binomial("logit"), 
    data = df_expanded)

Deviance Residuals: 
   Min      1Q  Median      3Q     Max  
-1.268  -1.095  -1.095   1.262   1.262  

Coefficients:
                          Estimate Std. Error z value Pr(>|z|)
(Intercept)                 0.2113     0.2307   0.916    0.360
treatment_statustreatment  -0.4087     0.2938  -1.391    0.164

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 274.41  on 197  degrees of freedom
Residual deviance: 272.46  on 196  degrees of freedom
AIC: 276.46

Number of Fisher Scoring iterations: 3

(See R bloggers for conversation on weights parameter in glm in the regression context.)

Attribution
Source : Link , Question Author : user321627 , Answer Author : Bryan Shalloway

Leave a Comment