Suppose I have a 2×2 table that looks like:
Disease No Disease Treatment 55 67 Control 42 34
I would like to do a logistic regression in R on this table. I understand that the standard way is to use the
glm
function with acbind
function in the response. In other words, the code looks like:glm(formula = cbind(c(55,67),c(42,34)) ~ as.factor(c(1, 0)), family = binomial())
I am wondering why
R
requires us to use thecbind
function and why simply using proportions is not sufficient. Is there a way to write this out explicitly as a formula? What would it look in the form of:log(p1−p)=β0+β1X
where X=1 if we have treatment and X=0 for control?
Right now it seems like I am regressing on a matrix for the dependent value.
Answer
First I show how you can specify a formula using aggregated data with proportions and weights. Then I show how you could specify a formula after dis-aggregating your data to individual observations.
Documentation inglm
indicates that:
“For a binomial GLM prior weights are used to give the number of
trials when the response is the proportion of successes”
I create new columns total
and proportion_disease
in df
for the ‘number of trials’ and ‘proportion of successes’ respectively.
library(dplyr)
df <- tibble(treatment_status = c("treatment", "no_treatment"),
disease = c(55, 42),
no_disease = c(67,34)) %>%
mutate(total = no_disease + disease,
proportion_disease = disease / total)
model_weighted <- glm(proportion_disease ~ treatment_status, data = df, family = binomial("logit"), weights = total)
The above weighted approach takes in aggregated data and will give the same solution as the cbind
method but allows you to specify a formula. (Below is equivalent to Original Poster’s method but cbind(c(55,42), c(67,34))
rather than cbind(c(55,67),c(42,34))
so that ‘Disease’ rather than ‘Treatment’ is the response variable.)
model_cbinded <- glm(cbind(disease, no_disease) ~ treatment_status, data = df, family = binomial("logit"))
You could also just dis-aggregate your data into individual observations and pass these into glm
(allowing you to specify a formula as well).
df_expanded <- tibble(disease_status = c(1, 1, 0, 0),
treatment_status = rep(c("treatment", "control"), 2)) %>%
.[c(rep(1, 55), rep(2, 42), rep(3, 67), rep(4, 34)), ]
model_expanded <- glm(disease_status ~ treatment_status, data = df_expanded, family = binomial("logit"))
Let’s compare these now by passing each model into summary
. model_weighted and model_cbinded both produce the exact same results. model_expanded produces the same coefficients and standard errors, though outputs different degrees of freedom, deviance, AIC, etc. (corresponding with the number of rows/observations).
> lapply(list(model_weighted, model_cbinded, model_expanded), summary)
[[1]]
Call:
glm(formula = proportion_disease ~ treatment_status, family = binomial("logit"),
data = df, weights = total)
Deviance Residuals:
[1] 0 0
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.2113 0.2307 0.916 0.360
treatment_statustreatment -0.4087 0.2938 -1.391 0.164
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1.9451e+00 on 1 degrees of freedom
Residual deviance: 1.0658e-14 on 0 degrees of freedom
AIC: 14.028
Number of Fisher Scoring iterations: 2
[[2]]
Call:
glm(formula = cbind(disease, no_disease) ~ treatment_status,
family = binomial("logit"), data = df)
Deviance Residuals:
[1] 0 0
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.2113 0.2307 0.916 0.360
treatment_statustreatment -0.4087 0.2938 -1.391 0.164
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1.9451e+00 on 1 degrees of freedom
Residual deviance: 1.0658e-14 on 0 degrees of freedom
AIC: 14.028
Number of Fisher Scoring iterations: 2
[[3]]
Call:
glm(formula = disease_status ~ treatment_status, family = binomial("logit"),
data = df_expanded)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.268 -1.095 -1.095 1.262 1.262
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.2113 0.2307 0.916 0.360
treatment_statustreatment -0.4087 0.2938 -1.391 0.164
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 274.41 on 197 degrees of freedom
Residual deviance: 272.46 on 196 degrees of freedom
AIC: 276.46
Number of Fisher Scoring iterations: 3
(See R bloggers for conversation on weights
parameter in glm
in the regression context.)
Attribution
Source : Link , Question Author : user321627 , Answer Author : Bryan Shalloway