Rare event logistic regression bias: how to simulate the underestimated p’s with a minimal example?

CrossValidated has several questions on when and how to apply the rare event bias correction by King and Zeng (2001). I am looking for something different: a minimal simulation-based demonstration that the bias exists.

In particular, King and Zeng state

“… in rare events data the biases in probabilities can be
substantively meaningful with sample sizes in the thousands and are in
a predictable direction: estimated event probabilities are too small.”

Here is my attempt to simulate such a bias in R:

# FUNCTIONS
do.one.sim = function(p){
N = length(p)
# Draw fake data based on probabilities p
y = rbinom(N, 1, p)
# Extract the fitted probability.
#    If p is constant, glm does y ~ 1, the intercept-only model.
#    If p is not constant, assume its smallest value is p[1]:
glm(y ~ p, family = 'binomial')$fitted[1] } mean.of.K.estimates = function(p, K){ mean(replicate(K, do.one.sim(p) )) } # MONTE CARLO N = 100 p = rep(0.01, N) reps = 100 # The following line may take about 30 seconds sim = replicate(reps, mean.of.K.estimates(p, K=100)) # Z-score: abs(p[1]-mean(sim))/(sd(sim)/sqrt(reps)) # Distribution of average probability estimates: hist(sim)  When I run this, I tend to get very small z-scores, and the histogram of estimates is very close to centered over the truth p = 0.01. What am I missing? Is it that my simulation is not large enough show the true (and evidently very small) bias? Does the bias require some kind of covariate (more than the intercept) to be included? Update 1: King and Zeng include a rough approximation for the bias of $\beta_0$ in equation 12 of their paper. Noting the N in the denominator, I drastically reduced N to be 5 and re-ran the simulation, but still no bias in the estimated event probabilities is evident. (I used this only as inspiration. Note that my question above is about estimated event probabilities, not $\hat \beta_0$.) Update 2: Following a suggestion in the comments, I included an independent variable in the regression, leading to equivalent results: p.small = 0.01 p.large = 0.2 p = c(rep(p.small, round(N/2) ), rep(p.large, N- round(N/2) ) ) sim = replicate(reps, mean.of.K.estimates(p, K=100))  Explanation: I used p itself as the independent variable, where p is a vector with a repetitions of a small value (0.01) and a larger value (0.2). In the end, sim stores only the estimated probabilities corresponding to $p = 0.01$ and there is no sign of bias. Update 3 (May 5, 2016): This does not noticeably change the results, but my new inner simulation function is do.one.sim = function(p){ N = length(p) # Draw fake data based on probabilities p y = rbinom(N, 1, p) if(sum(y) == 0){ # then the glm MLE = minus infinity to get p = 0 return(0) }else{ # Extract the fitted probability. # If p is constant, glm does y ~ 1, the intercept only model. # If p is not constant, assume its smallest value is p[1]: return(glm(y ~ p, family = 'binomial')$fitted[1])
}
}


Explanation: The MLE when y is identically zero does not exist (thanks to comments here for the reminder). R fails to throw a warning because its “positive convergence tolerance” actually gets satisfied. More liberally speaking, the MLE exists and is minus infinity, which corresponds to $p=0$; hence my function update. The only other coherent thing I can think of doing is discarding those runs of the simulation where y is identically zero, but that would clearly lead to results even more counter to the initial claim that “estimated event probabilities are too small”.

This is an interesting question – I’m have done a few simulations that I post below in the hope that this stimulates further discussion.

First of all, a few general comments:

• The paper you cite is about rare-event bias. What was not clear to me before (also with respect to comments that were made above) is if there is anything special about cases where you have 10/10000 as opposed to 10/30 observations. However, after some simulations, I would agree there is.

• A problem I had in mind (I have encountered this often, and there was recently a paper in Methods in Ecology and Evolution on that, I couldn’t find the reference though) is that you can get degenerate cases with GLMs in small-data situations, where the MLE is FAAAR away from the truth, or even at – / + infinity (due to the nonlinear link I suppose). It’s not clear to me how one should treat these cases in the bias estimation, but from my simulations I would say they seem key for the rare-event bias. My intuition would be to remove them, but then it’s not quite clear how far out they have to be to be removed. Maybe something to keep in mind for bias-correction.

• Also, these degenerate cases seem prone to cause numerical problems (I have therefore increased maxit in the glm function, but one could think about increasing epsilon as well to make sure one actually reports the true MLE).

Anyway, here some code that calculates the difference between estimates and truth for intercept, slope and predictions in a logistic regression, first for a low sample size / moderate incidence situation:

set.seed(123)
replicates = 1000
N= 40
slope = 2 # slope (linear scale)
intercept = - 1 # intercept (linear scale)

bias <- matrix(NA, nrow = replicates, ncol = 3)
incidencePredBias <- rep(NA, replicates)

for (i in 1:replicates){
pred = runif(N,min=-1,max=1)
linearResponse = intercept + slope*pred
data = rbinom(N, 1, plogis(linearResponse))
fit <- glm(data ~ pred, family = 'binomial', control = list(maxit = 300))
bias[i,1:2] = fit\$coefficients - c(intercept, slope)
bias[i,3] = mean(predict(fit,type = "response")) - mean(plogis(linearResponse))
}

par(mfrow = c(1,3))
text = c("Bias intercept", "Bias slope", "Bias prediction")

for (i in 1:3){
hist(bias[,i], breaks = 100, main = text[i])
abline(v=mean(bias[,i]), col = "red", lwd = 3)
}

apply(bias, 2, mean)
apply(bias, 2, sd) / sqrt(replicates)


The resulting bias and standard errors for intercept, slope and prediction are

-0.120429315  0.296453122 -0.001619793
0.016105833  0.032835468  0.002040664


I would conclude that there is pretty good evidence for a slight negative bias in the intercept, and a slight positive bias in the slope, although a look at the plotted results shows that the bias is small compared the the variance of the estimated values.

If I’m setting the parameters to a rare-event situation

N= 4000
slope = 2 # slope (linear scale)
intercept = - 10 # intercept (linear scale)


I’m getting a larger bias for the intercept, but still NONE on the prediction

   -1.716144e+01  4.271145e-01 -3.793141e-06
5.039331e-01  4.806615e-01  4.356062e-06


In the histogram of the estimated values, we see the phenomenon of degenerate parameter estimates (if we should call them like that)

Let’s remove all rows for which intercept estimates are <20

apply(bias[bias[,1] > -20,], 2, mean)
apply(bias[bias[,1] > -20,], 2, sd) / sqrt(length(bias[,1] > -10))


The bias decreases, and things become a bit more clear in the figures – parameter estimates are clearly not normally distributed. I wonder that that means for the validity of the CIs that are reported.

-0.6694874106  1.9740437782  0.0002079945
1.329322e-01 1.619451e-01 3.242677e-06


I would conclude the rare event bias on the intercept is driven by rare events itself, namely those rare, extremely small estimates. Not sure if we want to remove them or not, not sure what the cutoff would be.

An important thing to note though is that, either way, there seems to be no bias on predictions at the response scale – the link function simply absorbs these extremely small values.