Does it make sense to use Logistic regression with binary outcome and predictor?

I have a binary outcome variable {0,1} and a predictor variable {0,1}. My thoughts are that it doesn’t make sense to do logistic unless I include other variables and calculate the odds ratio.

With one binary predictor, wouldn’t calculation of probability suffice vs odds ratio?

Answer

In this case you can collapse your data to
XY010S00S011S10S11
where Sij is the number of instances for x=i and y=j with i,j{0,1}. Suppose there are n observations overall.

If we fit the model pi=g1(xTiβ)=g1(β0+β11xi=1) (where g is our link function) we’ll find that ˆβ0 is the logit of the proportion of successes when xi=0 and ˆβ0+ˆβ1 is the logit of the proportion of successes when xi=1. In other words,
ˆβ0=g(S01S00+S01)
and
ˆβ0+ˆβ1=g(S11S10+S11).

Let’s check this is R.

n <- 54
set.seed(123)
x <- rbinom(n, 1, .4)
y <- rbinom(n, 1, .6)

tbl <- table(x=x,y=y)

mod <- glm(y ~ x, family=binomial())

# all the same at 0.5757576
binomial()$linkinv( mod$coef[1])
mean(y[x == 0])
tbl[1,2] / sum(tbl[1,])

# all the same at 0.5714286
binomial()$linkinv( mod$coef[1] + mod$coef[2])
mean(y[x == 1])
tbl[2,2] / sum(tbl[2,])

So the logistic regression coefficients are exactly transformations of proportions coming from the table.

The upshot is that we certainly can analyze this dataset with a logistic regression if we have data coming from a series of Bernoulli random variables, but it turns out to be no different than directly analyzing the resulting contingency table.


I want to comment on why this works from a theoretical perspective. When we’re fitting a logistic regression, we are using the model that Yi|xiBern(pi). We then decide to model the mean as a transformation of a linear predictor in xi, or in symbols pi=g1(β0+β1xi). In our case we only have two unique values of xi, and therefore there are only two unique values of pi, say p0 and p1. Because of our independence assumption we have
i:xi=0Yi=S01Bin(n0,p0)
and
i:xi=1Yi=S11Bin(n1,p1).
Note how we’re using the fact that the xi, and in turn n0 and n1, are nonrandom: if this was not the case then these would not necessarily be binomial.

This means that
S01/n0=S01S00+S01pp0 and S11/n1=S11S10+S11pp1.

The key insight here: our Bernoulli RVs are Yi|xi=jBern(pj) while our binomial RVs are Sj1Bin(nj,pj), but both have the same probability of success. That’s the reason why these contingency table proportions are estimating the same thing as an observation-level logistic regression. It’s not just some coincidence with the table: it’s a direct consequence of the distributional assumptions we have made.

Attribution
Source : Link , Question Author : kms , Answer Author : jld

Leave a Comment