Example when using accuracy as an outcome measure will lead to a wrong conclusion

I am looking into various performance measures for predictive models. A lot was written about problems of using accuracy, instead of something more continuous to evaluate model performance. Frank Harrell provides an example when adding an informative variable to a model will lead to a drop in accuracy, clearly counterintuitive and wrong conclusion.

enter image description here

However, in this case, this seems to be caused by having imbalanced classes, and thus it can be solved just by using balanced accuracy instead ((sens+spec)/2). Is there some example where using accuracy on a balanced dataset will lead to some clearly wrong or counterintuitive conclusions?


I am looking for something where accuracy will drop even when the model is clearly better, or that using accuracy will lead to a false positive selection of some features. It’s easy to make false-negative examples, where accuracy is the same for two models where one is clearly better using other criteria.


I’ll cheat.

Specifically, I have argued often (e.g., here) that the statistical part of modeling and prediction extends only to making probabilistic predictions for class memberships (or giving predictive densities, in the case of numerical forecasting). Treating a specific instance as if it belonged to a specific class (or point predictions in the numerical case), is not properly statistics any more. It is part of the decision theoretic aspect.

And decisions should not only be predicated on the probabilistic prediction, but also on costs of misclassifications, and on a host of other possible actions. For instance, even if you have only two possible classes, “sick” vs. “healthy”, you could have a large range of possible actions depending on how likely it is that a patient suffers from the disease, from sending him home because he is almost certainly healthy, to giving him two aspirin, to running additional tests, to immediately calling an ambulance and putting him on life support.

Assessing accuracy presupposes such a decision. Accuracy as a evaluation metric for classification is a category error.

So, to answer your question, I will walk down the path of just such a category error. We will consider a simple scenario with balanced classes where classifying without regard for the costs of misclassification will indeed mislead us badly.

Suppose an epidemic of Malignant Gutrot runs rampant in the population. Happily, we can screen everybody easily for some trait t (0\leq t \leq 1), and we know that the probability of developing MG depends linearly on t, p=\gamma t for some parameter \gamma (0\leq \gamma \leq 1). The trait t is uniformly distributed in the population.

Fortunately, there is a vaccine. Unfortunately, it is expensive, and the side effects are very uncomfortable. (I’ll let your imagination supply the details.) However, they are better than to suffer from MG.

In the interest of abstraction, I posit that there are indeed only two possible courses of action for any given patient, given their trait value t: either vaccinate, or do not vaccinate.

Thus, the question is: how should we decide who to vaccinate and who not to, given t? We will be utilitarian about this and aim at having the lowest total expected costs. It is obvious that this comes down to choosing a threshold \theta and to vaccinate everyone with t\geq\theta.

Model-and-decision 1 are accuracy-driven. Fit a model. Fortunately, we already know the model. Pick the threshold \theta that maximizes accuracy when classifying patients, and vaccinate everyone with t\geq \theta. We easily see that \theta=\frac{1}{2\gamma} is the magic number – everyone with t\geq \theta has a higher chance of contracting MG than not, and vice versa, so this classification probability threshold will maximize accuracy. Assuming balanced classes, \gamma=1, we will vaccinate half the population. Funnily enough, if \gamma<\frac{1}{2}, we will vaccinate nobody. (We are mostly interested in balanced classes, so let's disregard that we just let part of the population die a Horrible Painful Death.)

Needless to say, this does not take the differential costs of misclassification into account.

Model-and-decision 2 leverage both our probabilistic prediction ("given your trait t, your probability of contracting MG is \gamma t") and the cost structure.

First, here is a little graph. The horizontal axis gives the trait, the vertical axis the MG probability. The shaded triangle gives the proportion of the population who will contract MG. The vertical line gives some particular \theta. The horizontal dashed line at \gamma\theta will make the calculations below a bit simpler to follow. We assume \gamma>\frac{1}{2}, just to make life easier.


Let's give our costs names and calculate their contributions to total expected costs, given \theta and \gamma (and the fact that the trait is uniformly distributed in the population).

  • Let c^+_+ denote the cost for a patient who is vaccinated and would have contracted MG. Given \theta, the proportion of the population who incurs this cost is the shaded trapezoid at the bottom right with area
    (1-\theta)\gamma\theta + \frac{1}{2}(1-\theta)(\gamma-\gamma\theta).
  • Let c^-_+ denote the cost for a patient who is vaccinated and would not have contracted MG. Given \theta, the proportion of the population who incurs this cost is the unshaded trapezoid at the top right with area
    (1-\theta)(1-\gamma) + \frac{1}{2}(1-\theta)(\gamma-\gamma\theta).
  • Let c^-_- denote the cost for a patient who is not vaccinated and would not have contracted MG. Given \theta, the proportion of the population who incurs this cost is the unshaded trapezoid at the top left with area
    \theta(1-\gamma\theta) + \frac{1}{2}\theta\gamma\theta.
  • Let c^+_- denote the cost for a patient who is not vaccinated and would have contracted MG. Given \theta, the proportion of the population who incurs this cost is the shaded triangle at the bottom left with area

(In each trapezoid, I first calculate the area of the rectangle, then add the area of the triangle.)

Total expected costs are

c^+_+\bigg((1-\theta)\gamma\theta + \frac{1}{2}(1-\theta)(\gamma-\gamma\theta)\bigg) +
c^-_+\bigg((1-\theta)(1-\gamma) + \frac{1}{2}(1-\theta)(\gamma-\gamma\theta)\bigg) +
c^-_-\bigg(\theta(1-\gamma\theta) + \frac{1}{2}\theta\gamma\theta\bigg) +

Differentiating and setting the derivative to zero, we obtain that expected costs are minimized by
\theta^\ast = \frac{c^-_+-c^-_-}{\gamma(c^+_-+c^-_+-c^+_+-c^-_-)}.

This is only equal to the accuracy maximizing value of \theta for a very specific cost structure, namely if and only if
\frac{1}{2\gamma} = \frac{c^-_+-c^-_-}{\gamma(c^+_-+c^-_+-c^+_+-c^-_-)},
\frac{1}{2} = \frac{c^-_+-c^-_-}{c^+_-+c^-_+-c^+_+-c^-_-}.

As an example, suppose that \gamma=1 for balanced classes and that costs are
c^+_+ = 1, \quad c^-_+=2, \quad c^+_-=10, \quad c^-_-=0.
Then the accuracy maximizing \theta=\frac{1}{2} will yield expected costs of 1.875, whereas the cost minimizing \theta=\frac{2}{11} will yield expected costs of 1.318.

In this example, basing our decisions on non-probabilistic classifications that maximized accuracy led to more vaccinations and higher costs than using a decision rule that explicitly used the differential cost structures in the context of a probabilistic prediction.

Bottom line: accuracy is only a valid decision criterion if

  • there is a one-to-one relationship between classes and possible actions
  • and the costs of actions applied to classes follow a very specific structure.

In the general case, evaluating accuracy asks a wrong question, and maximizing accuracy is a so-called type III error: providing the correct answer to the wrong question.

R code:

gamma <- 0.7

cost_treated_positive <- 1          # cost of treatment, side effects unimportant
cost_treated_negative <- 2          # cost of treatment, side effects unnecessary
cost_untreated_positive <- 10       # horrible, painful death
cost_untreated_negative <- 0        # nothing

expected_cost <- function ( theta ) {
    cost_treated_positive * ( (1-theta)*theta*gamma + (1-theta)*(gamma-gamma*theta)/2 ) +
    cost_treated_negative * ( (1-theta)*(1-gamma) + (1-theta)*(gamma-gamma*theta)/2 ) +
    cost_untreated_negative *( theta*(1-gamma*theta) + theta*gamma*theta/2 ) +
    cost_untreated_positive * theta*gamma*theta/2

(theta <- optim(par=0.5,fn=expected_cost,lower=0,upper=1,method="L-BFGS-B")$par)

plot(c(0,1),c(0,1),type="n",bty="n",xaxt="n",xlab="Trait t",yaxt="n",ylab="MG probability")


Source : Link , Question Author : rep_ho , Answer Author : Tamas Ferenci

Leave a Comment