PREFACE: I don’t care about the merits of using a cutoff or not, or how one should choose a cutoff. My question is purely mathematical and due to curiosity.
Logistic regression models the posterior conditional probability of class A versus class B and it fits a hyperplane where the posterior conditional probabilities are equal. So in theory, I understood that a 0.5 classification point will minimize total errors regardless of set balance, since it models the posterior probability (assuming you consistently encounter the same class ratio).
In my real life example, I obtain very poor accuracy using P > 0.5 as my classifying cutoff (about 51% accuracy). However, when I looked at the AUC it is above 0.99. So I looked at some different cutoff values and found that P > 0.6 gave me 98% accuracy (90% for the smaller class and 99% for the bigger class) – only 2% of cases misclassified.
The classes are heavily unbalanced (1:9) and it is a high-dimensional problem. However, I allocated the classes equally to each cross-validation set so that there should not be a difference between the balance of classes between model fit and then prediction. I also tried using the same data from the model fit and in predictions and the same issue occurred.
I’m interested in the reason why 0.5 would not minimize errors, I thought this would be by design if the model is being fit by minimizing cross-entropy loss.
Does anyone have any feedback as to why this happens? Is it due to adding penalization, can someone explain what is happening if so?
You don’t have to get predicted categories from a logistic regression model. It can be fine stay with predicted probabilities. If you do get predicted categories, you should not use that information to do anything other than say ‘this observation is best classified into this category’. For example, you should not use ‘accuracy’ / percent correct to select a model.
Having said those things, .50 is rarely going to be the optimal cutoff for classifying observations. To get an intuitive sense of how this could happen, imagine that you had N=100 with 99 observations in the positive category. A simple, intercept-only model could easily have 49 false negatives when you use .50 as your cutoff. On the other hand, if you just called everything positive, you would have 1 false positive, but 99% correct.
More generally, logistic regression is trying to fit the true probability positive for observations as a function of explanatory variables. It is not trying to maximize accuracy by centering predicted probabilities around the .50 cutoff. If your sample isn’t 50% positive, there is just no reason .50 would maximize the percent correct.