# Does it make sense to use Logistic regression with binary outcome and predictor?

I have a binary outcome variable {0,1} and a predictor variable {0,1}. My thoughts are that it doesn’t make sense to do logistic unless I include other variables and calculate the odds ratio.

With one binary predictor, wouldn’t calculation of probability suffice vs odds ratio?

In this case you can collapse your data to

where $S_{ij}$ is the number of instances for $x = i$ and $y =j$ with $i,j \in \{0,1\}$. Suppose there are $n$ observations overall.

If we fit the model $p_i = g^{-1}(x_i^T \beta) = g^{-1}(\beta_0 + \beta_1 1_{x_i = 1})$ (where $g$ is our link function) we’ll find that $\hat \beta_0$ is the logit of the proportion of successes when $x_i = 0$ and $\hat \beta_0 + \hat \beta_1$ is the logit of the proportion of successes when $x_i = 1$. In other words,

and

Let’s check this is R.

n <- 54
set.seed(123)
x <- rbinom(n, 1, .4)
y <- rbinom(n, 1, .6)

tbl <- table(x=x,y=y)

mod <- glm(y ~ x, family=binomial())

# all the same at 0.5757576
binomial()$linkinv( mod$coef[1])
mean(y[x == 0])
tbl[1,2] / sum(tbl[1,])

# all the same at 0.5714286
binomial()$linkinv( mod$coef[1] + mod\$coef[2])
mean(y[x == 1])
tbl[2,2] / sum(tbl[2,])


So the logistic regression coefficients are exactly transformations of proportions coming from the table.

The upshot is that we certainly can analyze this dataset with a logistic regression if we have data coming from a series of Bernoulli random variables, but it turns out to be no different than directly analyzing the resulting contingency table.

I want to comment on why this works from a theoretical perspective. When we’re fitting a logistic regression, we are using the model that $Y_i | x_i \stackrel{\perp}{\sim} \text{Bern}(p_i)$. We then decide to model the mean as a transformation of a linear predictor in $x_i$, or in symbols $p_i = g^{-1}\left( \beta_0 + \beta_1 x_i\right)$. In our case we only have two unique values of $x_i$, and therefore there are only two unique values of $p_i$, say $p_0$ and $p_1$. Because of our independence assumption we have

and

Note how we’re using the fact that the $x_i$, and in turn $n_0$ and $n_1$, are nonrandom: if this was not the case then these would not necessarily be binomial.

This means that

The key insight here: our Bernoulli RVs are $Y_i | x_i = j \sim \text{Bern}(p_j)$ while our binomial RVs are $S_{j1} \sim \text{Bin}(n_j, p_j)$, but both have the same probability of success. That’s the reason why these contingency table proportions are estimating the same thing as an observation-level logistic regression. It’s not just some coincidence with the table: it’s a direct consequence of the distributional assumptions we have made.