# Why P>0.5 cutoff is not “optimal” for logistic regression?

PREFACE: I don’t care about the merits of using a cutoff or not, or how one should choose a cutoff. My question is purely mathematical and due to curiosity.

Logistic regression models the posterior conditional probability of class A versus class B and it fits a hyperplane where the posterior conditional probabilities are equal. So in theory, I understood that a 0.5 classification point will minimize total errors regardless of set balance, since it models the posterior probability (assuming you consistently encounter the same class ratio).

In my real life example, I obtain very poor accuracy using P > 0.5 as my classifying cutoff (about 51% accuracy). However, when I looked at the AUC it is above 0.99. So I looked at some different cutoff values and found that P > 0.6 gave me 98% accuracy (90% for the smaller class and 99% for the bigger class) – only 2% of cases misclassified.

The classes are heavily unbalanced (1:9) and it is a high-dimensional problem. However, I allocated the classes equally to each cross-validation set so that there should not be a difference between the balance of classes between model fit and then prediction. I also tried using the same data from the model fit and in predictions and the same issue occurred.

I’m interested in the reason why 0.5 would not minimize errors, I thought this would be by design if the model is being fit by minimizing cross-entropy loss.

Does anyone have any feedback as to why this happens? Is it due to adding penalization, can someone explain what is happening if so?

Having said those things, $$.50.50$$ is rarely going to be the optimal cutoff for classifying observations. To get an intuitive sense of how this could happen, imagine that you had $$N=100N=100$$ with $$9999$$ observations in the positive category. A simple, intercept-only model could easily have $$4949$$ false negatives when you use $$.50.50$$ as your cutoff. On the other hand, if you just called everything positive, you would have $$11$$ false positive, but $$99%99\%$$ correct.
More generally, logistic regression is trying to fit the true probability positive for observations as a function of explanatory variables. It is not trying to maximize accuracy by centering predicted probabilities around the $$.50.50$$ cutoff. If your sample isn’t $$50%50\%$$ positive, there is just no reason $$.50.50$$ would maximize the percent correct.