Evaluating probabilistic forecasts of K-most-likely events from an arbitrarily large event space

Suppose a populous nation has a high homicide rate and an understaffed police force. The police chief hires a statistician and together they decide to take a preventative approach by identifying would-be-murderers before they commit the crime, along the lines of Minority Report.

The police chief requires the statistician to provide the following on a daily basis:

  1. A list of tomorrow’s top 100 most-likely murderers. (The statistician may have information about the entire citizen population, but the chief doesn’t have time to think about more than 100 cases.)
  2. For each person on the list, the statistician’s best estimate of the probability that the person will commit a murder (in the absence of intervention).

The police chief will regularly evaluate the statistician’s forecasts and provide bonus pay for good performance. Unfortunately, the chief does not know how to score the forecasts in a way that incentivizes the statistician to honestly strive toward the objectives (1) and (2). Can you help?

Here are two basic proposals of increasing complexity:

  • Score = recall = The number of people who attempt murder that the statistician included on the list. But this gives no incentive for accurate probabilities (2).
  • Score = $100 – \sum_{i=1}^{100} (O_i – p_i)^2 $, similar to the Brier score. Here $p_i$ is the forecasted probability for the $i$th person on the list, and $O_i$ is the true outcome (0 or 1) for their murdership status. But the statistician can easily maximize this by selecting 100 people with no chance of being murderers and taking $p_i$ to be identically 0.

Any other ideas? I strongly suspect that this is not a new problem; a good reference may suffice.


I admire your commitment to world-building research for that dystopian novel you’ve been working on!

A possible argument that this problem is underdetermined without additional assumptions.

It seems (I lack definite proof) that we need to know at least the overall population size, and presumably some other factors as well. Consider a likelihood score.

Assume murders are committed independently randomly with some murderousness probability $\theta_i$ by each member of the population $i$ (probably not true but let’s run with it). The probability space is the powerset $\Omega_n = \mathcal P(\{0, …, n-1\})$ for population size $n$. Then outcome $X$ happens with probability

$$P(X|\theta) = \underset{i<n}\prod \theta_i^{i \in X}(1-\theta_i)^{i \notin X}$$

Then, as alluded to in your remarks, for a complete prediction $\hat\theta$ of murderousness in the population, we could appropriately score the prediction, for example with the likelihood

$$\mathcal L(\theta|X) = P(X|\theta)$$

An alternative could additionally incorporate some Bayesian prior and instead score the a-posteriori probability/credence of a particular prediction. (An appropriate choice would be a product of independent Beta distributions, one for each member of the population, which is then conjugate to the set of independent Bernoulli samples of each person’s murdership).

But for a truncated prediction $\hat\theta_k$ of top-k-murderousness, the likelihood is undefined. For example the prediction $\hat\theta_3 = (0:0.3, 1:0.2, 2:0.1)$ might correspond to the ‘full’ parameterisation $\hat\theta^\star = \hat\theta_3 + (3:0.1, …, 99:0.1)$ or to $\hat\theta^\star = \hat\theta_3 + (3:0.001, …, 99:0.001)$, each of which assign very different probabilities to any outcome in $\Omega_{100}$ and consequently have very different likelihood or a-posteriori credibility.

I’m not completely clear from your question if the full outcome $X$ is observed, or only the truncated event $X_k$ consisting of the murdership of the named top-k-murderous members of the population.

Notice that, if we do make a particular choice of extrapolation from $\hat\theta_k$ to the full $\hat\theta^\star$, a truncated observation $X_k$ which witnesses only those individuals predicted in $\hat\theta_k$ is a well-defined event over the probability space and thus has a well-defined probability, allowing a likelihood score to be derived. But it suffers from the problem you identified for Brier score, where the statistician can control the censoring of the observations to avoid the first desideratum of naming only the most credible murderers.

If instead we have access to $X$, the full observation of murders committed, the likelihood or a-posteriori credibility of an extrapolated $\hat\theta^\star$ appears to me to be both defined and well-incentivised.

What remains with this picture is how to sensibly extrapolate from a truncated prediction $\hat\theta_k$ to a full prediction $\theta^\star$.

A computationally tractable approach would be to have the statistician commit to a population size $n$ and a uniform baseline murderousness $p$ for the rest of the population not identified in $\hat\theta_k$, producing a Binomial ‘rest-of-population’ murder-count distribution. For suitable $n$ and $p$ the ‘rest-of-population’ likelihood factor could be even more tractably approximated as a Poisson and you could simplify and have her propose such a Poisson parameter $\lambda$. (This Poisson case is very plausible in the motivating scenario of populations and murders, but may not transfer to other cases.)

Letting $r = |X-dom(\hat\theta_k)|$ the number of ‘surprise murders’,

Binomial case
\mathcal L(\hat\theta_k, n, p|X) =
\binom {n-k} r p^r (1-p)^{n-k-r}
\underset{i \in dom(\hat\theta_k)}\prod
\hat\theta_k[i]^{i \in X}(1 – \hat\theta_k[i])^{i \notin X}

Poisson case
\mathcal L(\hat\theta_k, \lambda|X) =
\frac {\lambda^r e^{-\lambda}} {r!}
\underset{i \in dom(\hat\theta_k)}\prod
\hat\theta_k[i]^{i \in X}(1 – \hat\theta_k[i])^{i \notin X}

A ‘generous’ and similarly tractable approach might be to give ‘benefit of the doubt’ and extrapolate to $\underset {i \in dom(\hat\theta_k)} {min} \hat\theta_k[i]$ for anyone who did in fact murder, and $0$ for anyone who did in fact not murder. This should still incentivise nominating the most plausible murderers, and giving reasonable estimates, but it might introduce some bias.

Generous case
\mathcal L^\star(\hat\theta_k|X) =
\left(\underset {i \in dom(\hat\theta_k)} {min} \hat\theta_k[i]\right)^r
\underset{i \in dom(\hat\theta_k)}\prod
\hat\theta_k[i]^{i \in X}(1 – \hat\theta_k[i])^{i \notin X}

Source : Link , Question Author : zkurtz , Answer Author : Oly

Leave a Comment