P(X1

I have a set of mutually independent normal distributions X_1 to X_5 (with means and standard deviations) which represent finishing times for swimmers over a certain distance. The actual data is as follows:

X_1(60, 3.0)
X_2(61, 1.0)
X_3(58, 2.3)
X_4(63, 2.4)
X_5(61, 1.7)
So swimmer 1 (X_1) has a mean finishing time of 60 seconds with a standard deviation of 3.0 seconds.

Question 1: What is the probability of an event where X_i finishes first. e.g.

P(X_1 \lt X_i, i=2,\ldots,n)

Question 2: If I calculate this for all swimmers, can I simply order the results to determine the most probable finishing order?

This is not homework.

Based on the answers to this Cross Validated question, I have tried to solve this problem based on the first answer. i.e.

\Pr(X_1 \le X_i, i=2,\ldots,n) = \int_{-\infty}^{\infty} \phi_1(t) [1 – \Phi_2(t)]\cdots[1 – \Phi_n(t)]dt

Where \phi_i is the PDF of X_i and \Phi_i is its CDF.

Based on this formula, the results I obtained were:

\Pr(X_1 \le X_i, i=2\ldots5) = 0.259653
\Pr(X_2 \le X_i, i=1,3\ldots5) = 0.214375
\Pr(X_3 \le X_i, i=1\ldots2, 4\ldots5) = 0.611999
\Pr(X_4 \le X_i, i=1\ldots3, 5) = 0.0263479
\Pr(X_5 \le X_i, i=1\ldots4) = 0.0697597
However, the probabilities add to 1.182135 when they should add to 1.0. I’m not sure if the formula is incorrect or my implementation of the integral (I used Excel and the trapezoidal method).

I also attempted to solve the problem using Dillip’s method (from the above mentioned question) as follows:

\begin{align*}
P(X_1 < \max X_i) &= P\{(X_1 < X_2) \cup \cdots \cup (X_1 < X_n)\\
&\leq \sum_{i=2}^n P(X_1 < X_i)\\
&= \sum_{i=2}^n Q\left(\frac{\mu_1 – \mu_i}{\sqrt{\sigma_1^2 + \sigma_i^2}}\right)
\end{align*}

However, the probability results were much greater than 1 in most cases so abandoned this approach. By the way, what exactly does \max X_i mean?

Any assistance in calculating the probability would be appreciated.

Answer

I always find it best in these situations to run a Monte Carlo simulation to check (roughly) what the correct answer should be. Here is some R code for doing that:

doRace <- function()
{
  times <- rnorm(mean=c(60,61,58,63,61),sd=c(3,1,2.3,2.4,1.7),n=5)
  winner <- which.min(times)
  winner
}

winners <- replicate(n=10000,expr=doRace())
table(winners) / length(winners)

Which gives the following output for me (of course you will get slightly different answers depending on the state of your random number generator):

winners
 1      2      3      4      5 
0.2573 0.0317 0.6108 0.0282 0.0720

This indicates that the issue is with swimmer 2, as these results otherwise agree well with yours. I suspect you just have an incorrect cell reference somewhere. Note that a reasonable resolution to the problem is to use the Monte Carlo simulation not just as a verification method but as the final implementation for calculating the probabilities. After all, numerical integration is itself an approximate and computationally expensive procedure.

In order to be absolutely sure, we can use the integrate() function in R. First define the integral:

integral<- function(x,whichSwimmer)
{
  means <- c(60,61,58,63,61)
  sds <- c(3,1,2.3,2.4,1.7)

  dnorm(x,mean=means[whichSwimmer],sd=sds[whichSwimmer]) *
    (1 - pnorm(x,mean=means[-whichSwimmer][1],sd=sds[-whichSwimmer][1])) *
    (1 - pnorm(x,mean=means[-whichSwimmer][2],sd=sds[-whichSwimmer][2])) *
    (1 - pnorm(x,mean=means[-whichSwimmer][3],sd=sds[-whichSwimmer][3])) *
    (1 - pnorm(x,mean=means[-whichSwimmer][4],sd=sds[-whichSwimmer][4]))
}

Then we can calculate the probability for each swimmer in turn:

>integrate(integral,whichSwimmer=1,lower=0,upper=100)
0.2596532 with absolute error < 2.5e-05

>integrate(integral,whichSwimmer=2,lower=0,upper=100)
0.03223977 with absolute error < 6.4e-06

>integrate(integral,whichSwimmer=3,lower=0,upper=100)
0.6119995 with absolute error < 1.5e-06

>integrate(integral,whichSwimmer=4,lower=0,upper=100)
0.02634785 with absolute error < 1.4e-06

>integrate(integral,whichSwimmer=5,lower=0,upper=100)
0.06975967 with absolute error < 8.1e-05

Which gives very good agreement with the Monte Carlo simulation.

Note that although you can technically give lower and upper bounds of negative/positive infinity to integrate() I found that this caused the procedure to break down, giving clearly incorrect results.

EDIT: I've just noticed you had a second question regarding the most likely ordering of swimmers. Again, we can easily check whether the intuition about just ordering the probability of each swimmer winning is correct by running a Monte Carlo simulation. We just need to adapt the sampling function to return the order of the swimmers rather than only the winner:

doRace <- function()
{
  times <- rnorm(mean=c(60,61,58,63,61),sd=c(3,1,2.3,2.4,1.7),n=5)
  finishOrder <- order(times)

  paste(finishOrder,collapse="")
}

finishOrders <- replicate(n=1e6,expr=doRace())
which.max(table(finishOrders) / length(finishOrders))

I get the ouput:

31254 
   50

In other words, the most likely order is 3,1,2,5,4 which is not the same as ordering the swimmers by their probability of winning!

For me, this is another reason to prefer the Monte Carlo approach as the final implementation as you can easily answer this and other questions - e.g. what is each swimmer's probability of finishing second or, given that swimmer 1 finishes first, what is the most likely ordering of the remaining swimmers?

EDIT 2: To be able to answer these other questions, we need to change the sampling function again, this time to return the complete order in which the swimmers finish:

doRace <- function()
{
  times <- rnorm(mean=c(60,61,58,63,61),sd=c(3,1,2.3,2.4,1.7),n=5)
  finishOrder <- order(times)

  finishOrder
}

finishOrders <- replicate(n=1e6,expr=doRace())

finishOrders is a matrix where each column corresponds to a single simulated race, the first row gives the winner of each race, the second row the second placed swimmer of each race and so on. So, to get the probability that each swimmer finishes second we do:

> table(finishOrders[2,]) / ncol(finishOrders)

       1        2        3        4        5 
0.271749 0.198205 0.235460 0.075165 0.219421

To find the most likely order given that swimmer 1 wins the race is a little more fiddly. First, extract all races where the first row is equal to 1:

finishOrdersWhen1WinsRace <- finishOrders[,finishOrders[1,]==1]

Then turn the finish order of each race from a vector of numbers into a character string so we can use the table function to find the most frequent one:

> which.max(table(apply(finishOrdersWhen1WinsRace,2,paste,collapse="")))
13254 
    8

In other words, given that swimmer 1 wins the race, the most likely order of the remaining swimmers is 3,2,5,4, which occurs:

> max(table(apply(finishOrdersWhen1WinsRace,2,paste,collapse="")) / ncol(finishOrdersWhen1WinsRace))
[1] 0.2341227

23.4\% of the time.

I'm not sure whether a Monte Carlo approach is the only way to answer these questions but it seems likely that even if you can obtain closed-form expressions for the probabilities you'll need to rely on numerical integration like you did to find the winning probabilities.

Attribution
Source : Link , Question Author : Jason , Answer Author : M. Berk

Leave a Comment