Suppose I have a 2×2 table that looks like:

`Disease No Disease Treatment 55 67 Control 42 34`

I would like to do a logistic regression in R on this table. I understand that the standard way is to use the

`glm`

function with a`cbind`

function in the response. In other words, the code looks like:`glm(formula = cbind(c(55,67),c(42,34)) ~ as.factor(c(1, 0)), family = binomial())`

I am wondering why

`R`

requires us to use the`cbind`

function and why simply using proportions is not sufficient. Is there a way to write this out explicitly as a formula? What would it look in the form of:log(p1−p)=β0+β1X

where X=1 if we have treatment and X=0 for control?

Right now it seems like I am regressing on a matrix for the dependent value.

**Answer**

First I show how you can specify a formula using aggregated data with proportions and weights. Then I show how you could specify a formula after dis-aggregating your data to individual observations.

Documentation in`glm`

indicates that:

“For a binomial GLM prior weights are used to give the number of

trials when the response is the proportion of successes”

I create new columns `total`

and `proportion_disease`

in `df`

for the ‘number of trials’ and ‘proportion of successes’ respectively.

```
library(dplyr)
df <- tibble(treatment_status = c("treatment", "no_treatment"),
disease = c(55, 42),
no_disease = c(67,34)) %>%
mutate(total = no_disease + disease,
proportion_disease = disease / total)
model_weighted <- glm(proportion_disease ~ treatment_status, data = df, family = binomial("logit"), weights = total)
```

The above weighted approach takes in aggregated data and will give the same solution as the `cbind`

method but allows you to specify a formula. (Below is equivalent to Original Poster’s method but `cbind(c(55,42), c(67,34))`

rather than `cbind(c(55,67),c(42,34))`

so that ‘Disease’ rather than ‘Treatment’ is the response variable.)

```
model_cbinded <- glm(cbind(disease, no_disease) ~ treatment_status, data = df, family = binomial("logit"))
```

You could also just dis-aggregate your data into individual observations and pass these into `glm`

(allowing you to specify a formula as well).

```
df_expanded <- tibble(disease_status = c(1, 1, 0, 0),
treatment_status = rep(c("treatment", "control"), 2)) %>%
.[c(rep(1, 55), rep(2, 42), rep(3, 67), rep(4, 34)), ]
model_expanded <- glm(disease_status ~ treatment_status, data = df_expanded, family = binomial("logit"))
```

Let’s compare these now by passing each model into `summary`

. **model_weighted** and **model_cbinded** both produce the exact same results. **model_expanded** produces the same coefficients and standard errors, though outputs different degrees of freedom, deviance, AIC, etc. (corresponding with the number of rows/observations).

```
> lapply(list(model_weighted, model_cbinded, model_expanded), summary)
[[1]]
Call:
glm(formula = proportion_disease ~ treatment_status, family = binomial("logit"),
data = df, weights = total)
Deviance Residuals:
[1] 0 0
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.2113 0.2307 0.916 0.360
treatment_statustreatment -0.4087 0.2938 -1.391 0.164
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1.9451e+00 on 1 degrees of freedom
Residual deviance: 1.0658e-14 on 0 degrees of freedom
AIC: 14.028
Number of Fisher Scoring iterations: 2
[[2]]
Call:
glm(formula = cbind(disease, no_disease) ~ treatment_status,
family = binomial("logit"), data = df)
Deviance Residuals:
[1] 0 0
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.2113 0.2307 0.916 0.360
treatment_statustreatment -0.4087 0.2938 -1.391 0.164
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1.9451e+00 on 1 degrees of freedom
Residual deviance: 1.0658e-14 on 0 degrees of freedom
AIC: 14.028
Number of Fisher Scoring iterations: 2
[[3]]
Call:
glm(formula = disease_status ~ treatment_status, family = binomial("logit"),
data = df_expanded)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.268 -1.095 -1.095 1.262 1.262
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.2113 0.2307 0.916 0.360
treatment_statustreatment -0.4087 0.2938 -1.391 0.164
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 274.41 on 197 degrees of freedom
Residual deviance: 272.46 on 196 degrees of freedom
AIC: 276.46
Number of Fisher Scoring iterations: 3
```

(See R bloggers for conversation on `weights`

parameter in `glm`

in the regression context.)

**Attribution***Source : Link , Question Author : user321627 , Answer Author : Bryan Shalloway*