# In using the cbind() function in R for a logistic regression on a 2×22 \times 2 table, what is the explicit functional form of the regression equation?

Suppose I have a $$2×22 \times 2$$ table that looks like:

            Disease       No Disease
Treatment         55                67
Control           42                34


I would like to do a logistic regression in R on this table. I understand that the standard way is to use the glm function with a cbind function in the response. In other words, the code looks like:

glm(formula = cbind(c(55,67),c(42,34)) ~ as.factor(c(1, 0)), family = binomial())


I am wondering why R requires us to use the cbind function and why simply using proportions is not sufficient. Is there a way to write this out explicitly as a formula? What would it look in the form of:

$$log(p1−p)=β0+β1X log\left(\frac{p}{1-p}\right) = \beta_0 + \beta_1X$$

where $$X=1X = 1$$ if we have treatment and $$X=0X=0$$ for control?

Right now it seems like I am regressing on a matrix for the dependent value.

First I show how you can specify a formula using aggregated data with proportions and weights. Then I show how you could specify a formula after dis-aggregating your data to individual observations.

Documentation inglm indicates that:

“For a binomial GLM prior weights are used to give the number of
trials when the response is the proportion of successes”

I create new columns total and proportion_disease in df for the ‘number of trials’ and ‘proportion of successes’ respectively.

library(dplyr)
df <- tibble(treatment_status = c("treatment", "no_treatment"),
disease = c(55, 42),
no_disease = c(67,34)) %>%
mutate(total = no_disease + disease,
proportion_disease = disease / total)

model_weighted <- glm(proportion_disease ~ treatment_status, data = df, family = binomial("logit"), weights = total)


The above weighted approach takes in aggregated data and will give the same solution as the cbind method but allows you to specify a formula. (Below is equivalent to Original Poster’s method but cbind(c(55,42), c(67,34)) rather than cbind(c(55,67),c(42,34)) so that ‘Disease’ rather than ‘Treatment’ is the response variable.)

model_cbinded <- glm(cbind(disease, no_disease) ~ treatment_status, data = df, family = binomial("logit"))


You could also just dis-aggregate your data into individual observations and pass these into glm (allowing you to specify a formula as well).

df_expanded <- tibble(disease_status = c(1, 1, 0, 0),
treatment_status = rep(c("treatment", "control"), 2)) %>%
.[c(rep(1, 55), rep(2, 42), rep(3, 67), rep(4, 34)), ]

model_expanded <- glm(disease_status ~ treatment_status, data = df_expanded, family = binomial("logit"))


Let’s compare these now by passing each model into summary. model_weighted and model_cbinded both produce the exact same results. model_expanded produces the same coefficients and standard errors, though outputs different degrees of freedom, deviance, AIC, etc. (corresponding with the number of rows/observations).

    > lapply(list(model_weighted, model_cbinded, model_expanded), summary)
[[1]]

Call:
glm(formula = proportion_disease ~ treatment_status, family = binomial("logit"),
data = df, weights = total)

Deviance Residuals:
[1]  0  0

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept)                 0.2113     0.2307   0.916    0.360
treatment_statustreatment  -0.4087     0.2938  -1.391    0.164

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 1.9451e+00  on 1  degrees of freedom
Residual deviance: 1.0658e-14  on 0  degrees of freedom
AIC: 14.028

Number of Fisher Scoring iterations: 2

[[2]]

Call:
glm(formula = cbind(disease, no_disease) ~ treatment_status,
family = binomial("logit"), data = df)

Deviance Residuals:
[1]  0  0

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept)                 0.2113     0.2307   0.916    0.360
treatment_statustreatment  -0.4087     0.2938  -1.391    0.164

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 1.9451e+00  on 1  degrees of freedom
Residual deviance: 1.0658e-14  on 0  degrees of freedom
AIC: 14.028

Number of Fisher Scoring iterations: 2

[[3]]

Call:
glm(formula = disease_status ~ treatment_status, family = binomial("logit"),
data = df_expanded)

Deviance Residuals:
Min      1Q  Median      3Q     Max
-1.268  -1.095  -1.095   1.262   1.262

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept)                 0.2113     0.2307   0.916    0.360
treatment_statustreatment  -0.4087     0.2938  -1.391    0.164

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 274.41  on 197  degrees of freedom
Residual deviance: 272.46  on 196  degrees of freedom
AIC: 276.46

Number of Fisher Scoring iterations: 3


(See R bloggers for conversation on weights parameter in glm in the regression context.)