Unable to get correct coefficients for logistic regression in simulated dataset

I devised this toy example

library(sigmoid)
N <- 10000
age <- runif(N, min=20, max=90)
e <- rnorm(N, 0, 5)
hi <- logistic(-100+2*age+e)
hid <- ifelse(hi>=0.5, T, F)
hid <- as.factor(hid)
df <- data.frame(age=age, hid=hid)
lr <- glm(hid~age, data=df, family=binomial(link="logit"))
s <- summary(lr)
print(s)

The variable hid contains 4304 FALSE and 5696 TRUE.

I would have expected to get the correct coefficients out of the logistic regression.

Instead I am getting -39.46 for the intercept and 0.79 for the slope. Both with p-values $\approx$ 0.

What am I doing wrong?

Answer

If you’re trying to generate data from logistic regression’s assumed data generating mechanism, your code does not do that.

Logistic regression’s data generating mechanism looks like

$$ \eta = X\beta$$
$$ p = \dfrac{1}{1+e^{-\eta}}$$
$$ y \sim \operatorname{Binomial}(p, n) $$

What it looks like you’re trying to do is create a linear regression in the log odds space, error term included. That is incorrect. The error term comes from the binomial likelihood. To create data properly so that glm will estimate the parameters you’ve specified, you need to do

library(sigmoid)
N <- 10000
age <- runif(N, min=20, max=90)
#Changes here
p <- logistic(-100+2*age)
hid <- rbinom(N, 1, p)
# End changes
df <- data.frame(age=age, hid=hid)
lr <- glm(hid~age, data=df, family=binomial(link="logit"))
s <- summary(lr)
print(s)

```

Attribution
Source : Link , Question Author : robertspierre , Answer Author : Demetri Pananos

Leave a Comment