I have a dataset in which the event rate is very low ( 40,000 out of $12\cdot10^5$).
I am applying logistic regression on this. I have had a discussion with someone where it came out that logistic regression would not give good confusion matrix on such low event rate data. But because of the business problem and the way it has been defined, I can’t increase the number of events from 40,000 to any larger number though I agree that I can delete some nonevent population.
Please tell me your views on this, specifically:
- Does accuracy of logistic regression depend on event rate or is there any minimum event rate which is recommended ?
- Is there any special technique for low event rate data ?
- Would deleting my nonevent population would be good for the accuracy of my model ?
I am new to statistical modeling so forgive my ignorance and please address any associated issues that I could think about.
I’m going to answer your questions out of order:
3 Would deleting my nonevent population would be good for the accuracy of my model ?
Each observation will provide some additional information about the parameter (through the likelihood function). Therefore there is no point in deleting data, as you would just be losing information.
1 Does accuracy of logistic regression depend on event rate or is there any minimum event rate which is recommended ?
Technically, yes: a rare observation is much more informative (that is, the likelihood function will be steeper). If your event ratio was 50:50, then you would get much tighter confidence bands (or credible intervals if you’re being Bayesian) for the same amount of data. However you don’t get to choose your event rate (unless you’re doing a case-control study), so you’ll have to make do with what you have.
2 Is there any special technique for low event rate data ?
The biggest problem that might arise is perfect separation: this happens when some combination of variables gives all non-events (or all events): in this case, the maximum likelihood parameter estimates (and their standard errors), will approach infinity (although usually the algorithm will stop beforehand). There are two possible solutions:
a) removing predictors from the model: though this will make your algorithm converge, you will be removing the variable with the most explanatory power, so this only makes sense if your model was overfitting to begin with (such as fitting too many complicated interactions).
b) use some sort of penalisation, such as a prior distribution, which will shrink the estimates back to more reasonable values.