The intuition behind the different scoring rules

Consider the three scoring rules in the case of a binary prediction:

  1. Log: sum(log(ifelse(outcome, probability, 1-probability))) / n
  2. Brier: sum((outcome-probability)**2) / n
  3. Sphere: sum(ifelse(outcome, probability, 1-probability)/sqrt(probability**2+(1-probability)**2)) / n

What is the intuition behind them? When should I use one and not the other?
I am especially interested in the case of low prevalence (e.g., 0.1%).

PS. This is to evaluate the results from my calibration algorithm which I asked about before.

Answer

One place where log scoring may be inappropriate: the comparison of human forecasters (who may tend to overstate their confidence).

Log scoring strongly penalizes very overconfident wrong predictions. A wrong prediction that was made with 100% confidence gets an infinite penalty. For example, suppose a commentator says “I am 100% sure that Smith will win the election,” and then Smith loses the election. Under log scoring, the average score of all the commentator’s predictions is now permanently stuck at , the worst possible. It should be possible to distinguish that somebody who has made a single wrong 100% confidence prediction is a better forecaster than somebody who makes them all the time.

Attribution
Source : Link , Question Author : sds , Answer Author : fblundun

Leave a Comment