tl;dr
How do I correctly compute the Brier score for more than two classes? I got confusing results with different approaches. Details below.
As suggested to me in a comment to this question, I would like to evaluate the quality of a set of classifiers I trained with the Brier score. These classifiers are multiclass classifiers and the classes are imbalanced. The Brier score should be able to handle these conditions. However, I am not quite confident about how to apply the Brier score test. Say I have 10 data points and 5 classes:
One hot vectors represent which class is present in a given item of data:
targets = array([[0, 0, 0, 0, 1], [0, 0, 0, 0, 1], [0, 0, 0, 0, 1], [0, 1, 0, 0, 0], [0, 0, 0, 0, 1], [0, 0, 1, 0, 0], [1, 0, 0, 0, 0], [0, 1, 0, 0, 0], [1, 0, 0, 0, 0], [1, 0, 0, 0, 0]])
Vectors of probabilities represent the outputs of my classifiers, assigning a probability to each class
probs = array([[0.14, 0.38, 0.4 , 0.04, 0.05], [0.55, 0.05, 0.34, 0.04, 0.01], [0.3 , 0.35, 0.18, 0.09, 0.08], [0.23, 0.22, 0.04, 0.05, 0.46], [0. , 0.15, 0.47, 0.28, 0.09], [0.23, 0.13, 0.34, 0.27, 0.03], [0.32, 0.06, 0.59, 0.02, 0.01], [0.01, 0.19, 0.01, 0.03, 0.75], [0.27, 0.38, 0.03, 0.12, 0.2 ], [0.17, 0.45, 0.11, 0.25, 0.01]])
These matrices are coindexed, so
probs[i, j]
is the probability of classtargets[i, j]
.Now, according to Wikipedia the definition of the Brier Score for multiple classes is
1NN∑t=1R∑i=1(fti−oti)2
When I program this in Python and run it on the above
targets
andprobs
matrices, I get a result of 1.0069>>> def brier_multi(targets, probs): ... return np.mean(np.sum((probs  targets)**2, axis=1)) ... >>> brier_multi(targets, probs) 1.0068899999999998
But I am not sure if I interpreted the definition correctly.
For Python the sklearn library provides
sklearn.metrics.brier_score_loss
. While the documentation statesThe Brier score is appropriate for binary and categorical outcomes that can be structured as true or false
What the function actually does is pick one (or get one passed as an argument) of n>2 classes and treat that class as class 1 and all other classes as class 0.
For example, if we choose class 3 (index 2) as the 1 class and thus all other classes as class 0, we get:
>>> # get true classes by argmax over binary arrays ... true_classes = np.argmax(targets, axis=1) >>> >>> brier_score_loss(true_classes, probs[:,2], pos_label=2) 0.13272999999999996
alternatively:
>>> brier_score_loss(targets[:,2], probs[:,2]) 0.13272999999999996
This is indeed the binary version of the Brier score, as can be shown by manually defining and running it:
>>> def brier_bin_(targets, probs): ... return np.mean((targets  probs) ** 2) >>> brier_bin(targets[:,2], probs[:,2]) 0.13272999999999996
As you can see, this is the same result as with sklearn’s
brier_score_loss
.Wikipedia states about the binary version:
This formulation is mostly used for binary events (for example “rain”
or “no rain”). The above equation is a proper scoring rule only for
binary events;So… Now I am confused and have the following questions:
1) If sklearn computes the multi class Brier score as a One vs. All binary score, is that the only and correct way to compute the multi class Brier score?
Which leads me to
2) If that is so, my
brier_multi
code must be based on a misconception. What is my misconception about the definition of the multiclass Brier score?3) Maybe I am on the wrong track altogether. In which case, please explain to me, how I compute the Brier score correctly?
Answer
Wikipedia’s version of the Brier score for multiple categories is correct. Compare the original publication by Brier (1950), or any number of academic publications, e.g. Czado et al. (2009) (equation (6), though you would need to do some simple arithmetic and drop a constant 1 to arrive at Brier’s formulation).

If sklearn calculates a binary “one against all” Brier score and averages over all choices of a focal class, then it can certainly do so. However, it is simply not the Brier score. Passing it off as such is misleading and wrong.

The misconception lies entirely with sklearn.

Just use your
brier_multi
, it’s completely correct.
Attribution
Source : Link , Question Author : lo tolmencre , Answer Author : Stephan Kolassa