Why am I not getting a uniform p-value distribution from logistic regression with random predictors?

The code below generates a set of test data consisting of a series of “signal” probabilities with binomial noise around it. The code then uses 5000 sets of random numbers as “explanatory” series and calculates the logistic regression p-value for each.

I find that the random explanatory series are statistically significant at the 5% level in 57% of the cases. If you read through the longer part of the post below I attribute this to the presence of the strong signal in the data.

So here’s the primary question: what test should I use when evaluating the statistical significance of an explanatory variable when the data contains a strong signal? The simple p-value seems to be quite misleading.

Here’s a more detailed explanation of the issue.

I’m puzzled by results I get for logistic regression p-values when the predictor is actually just a set of random numbers. My initial thought was that the distribution of p-values should be flat in this case; the R code below actually shows a huge spike at low p-values. Here’s the code:

set.seed(541713)

lseries <-   50
nbinom  <-  100
ntrial  <- 5000
pavg    <- .1  # median probability
sd      <- 0 # data is pure noise
sd      <- 1 # data has a strong signal
orthogonalPredictor <- TRUE   # random predictor that is orthogonal to the true signal
orthogonalPredictor <- FALSE  # completely random predictor

qprobs  <- c(.05,.01) # find the true quantiles for these p-values

yactual <- sd * rnorm(lseries)  # random signal
pactual <- 1 / (1 + exp(-(yactual + log(pavg / (1-pavg)))))
heads   <- rbinom(lseries, nbinom, pactual)
  ## test data, binomial noise around pactual, the probability "signal"
flips   <- cbind(heads, nbinom - heads)
# summary(glm(flips ~ yactual, family = "binomial"))

pval <- numeric(ntrial)

for (i in 1:ntrial){
  yrandom <- rnorm(lseries)
  if (orthogonalPredictor){ yrandom <- residuals(lm(yrandom ~ yactual)) }
  s       <- summary(glm(flips ~ yrandom, family="binomial"))
  pval[i] <- s$coefficients[2,4]
}

hist(pval, breaks=100)
print(quantile(pval, probs=c(.01,.05)))
actualCL <- sapply(qprobs, function(c){ sum(pval <= c) / length(pval) })
print(data.frame(nominalCL=qprobs, actualCL))

The code generates test data consisting of binomial noise around a strong signal, then in a loop fits a logistic regression of the data against a set of random numbers, and accumulates the p-values of the random predictors; results are displayed as a histogram of p-values, the actual p-value quantiles for confidence levels of 1% and 5%, and the actual false positive rate corresponding to those confidence levels.

I think one reason for the unexpected results is that a random predictor will generally have some correlation with the true signal and this mostly accounts for the results. However if you set orthogonalPredictor to TRUE there will be zero correlation between the random predictors and actual signal, but the problem is still there at a reduced level. My best explanation for that is that since the true signal is not anywhere in the model being fitted then the model is misspecified and p-values are suspect anyway. But this is a catch-22 – who ever has a set of predictors available that exactly fits the data? So here are some additional questions:

  1. What is the precise null hypothesis for logistic regression? Is it that the data is purely binomial noise around a constant level (i.e., there isn’t a true signal)? If you set sd to 0 in the code then there is no signal and the histogram does look flat.

  2. The implicit null hypothesis in the code is that the predictor has no more explanatory power than a set of random numbers; it’s tested by using say the empirical 5% quantile for the p-value as displayed by the code. Is there a better way of testing this hypothesis, or at least one that’s not so numerically intensive?

—— Additional information

This code mimics the following problem: Historical default rates show significant variation over time (signal) driven by economic cycles; the actual default counts at a given point in time are binomial around these default probabilities. I was trying to find explanatory variables for the signal when I became suspicious of the p-values. In this test the signal is randomly ordered over time rather than showing the economic cycles, but that shouldn’t matter to logistic regression. So there is no overdispersion, the signal is really meant to be a signal.

Answer

There are several issues here. In particular, there seem to be some confusions about how to simulate a standard logistic regression. Briefly, you don’t add noise around... the probability "signal". As a result of the way you did this, there is a huge amount of variability in the resulting ‘binomial'(-esque) data, way more than there should be. Here are the probabilities in your dataset:

plot(flips[,1]/rowSums(flips))

enter image description here

If those .4+ observations end up on one side or the other, they will act as ‘outliers’ (they aren’t really) and drive a type 1 error in a model that doesn’t take into account the fact that these data aren’t really binomial. Here is a version that uses a simple hack to allow the model to detect and account for overdispersion:

set.seed(5082)
pval <- numeric(ntrial)
for (i in 1:ntrial){
  yrandom <- rnorm(lseries)
  s       <- summary(glm(flips ~ yrandom, family="quasibinomial"))  # changed family
  pval[i] <- s$coefficients[2,4]
}

hist(pval, breaks=100)
print(quantile(pval, probs=c(.01,.05)))
#          1%          5% 
# 0.006924617 0.046977246 
actualCL <- sapply(qprobs, function(c){ sum(pval <= c) / length(pval) })
print(data.frame(nominalCL=qprobs, actualCL))
#   nominalCL actualCL
# 1      0.05   0.0536
# 2      0.01   0.0128

enter image description here

This is the model summary from the last iteration. Note that the dispersion is estimated to be $\approx 12\times$ what it should be for a true binomial:

s
# Call:
# glm(formula = flips ~ yrandom, family = "quasibinomial")
# 
# Deviance Residuals: 
#    Min      1Q  Median      3Q     Max  
# -5.167  -2.925  -1.111   1.101   8.110  
# 
# Coefficients:
#             Estimate Std. Error t value Pr(>|t|)    
# (Intercept) -1.96910    0.14942 -13.178   <2e-16 ***
# yrandom     -0.02736    0.14587  -0.188    0.852    
# ---
# Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# 
# (Dispersion parameter for quasibinomial family taken to be 11.97867)
# 
#     Null deviance: 532.38  on 49  degrees of freedom
# Residual deviance: 531.96  on 48  degrees of freedom
# AIC: NA
# 
# Number of Fisher Scoring iterations: 5

Here is another version, where I fit the same model that you do, but just generate the data without the added noise around the signal. (Note that code that is otherwise the same is omitted for brevity.)

set.seed(541713)
...
pactual <- 1 / (1 + exp(-(log(pavg / (1-pavg)))))  # deleted yactual
...
for (i in 1:ntrial){
  yrandom <- rnorm(lseries)
  if (orthogonalPredictor){ yrandom <- residuals(lm(yrandom ~ yactual)) }
  s       <- summary(glm(flips ~ yrandom, family="binomial"))
  pval[i] <- s$coefficients[2,4]
}

hist(pval, breaks=100)
print(quantile(pval, probs=c(.01,.05)))
#         1%         5% 
# 0.01993318 0.07027473 
actualCL <- sapply(qprobs, function(c){ sum(pval <= c) / length(pval) })
print(data.frame(nominalCL=qprobs, actualCL))
#   nominalCL actualCL
# 1      0.05   0.0372
# 2      0.01   0.0036

enter image description here

Attribution
Source : Link , Question Author : Ron Hylton , Answer Author : gung – Reinstate Monica

Leave a Comment