Suppose I have samplings from two distinct populations. If I measure how long it takes each member to do a task, I can easily estimate the mean and variance of each population.
If I now hypothesise a random pairing with one individual from each population, can I estimate the probability that the first is faster than the second?
I do have a concrete example in mind: the measurements are timings for me cycling from A to B and the populations represent different routes I could take; I’m trying to work out what the probability is that picking route A for my next cycle will be faster than picking route B. When I actually do the cycle, I’ve got another data point for my sample set :).
I’m aware that this is a horribly simplistic way to try to work this out, not least because on any given day the wind is more likely to affect my time than anything else, so please let me know if you think I’m asking the wrong question…
Let the two means be μx and μy and their standard deviations be σx and σy, respectively. The difference in timings between two rides (Y−X) therefore has mean μy−μx and standard deviation √σ2x+σ2y. The standardized difference (“z score”) is
Unless your ride times have strange distributions, the chance that ride Y takes longer than ride X is approximately the Normal cumulative distribution, Φ, evaluated at z.
You can work this probability out on one of your rides because you already have estimates of μx etc. :-). For this purpose it’s easy to memorize a few key values of Φ: Φ(0)=.5=1/2, Φ(−1)≈0.16≈1/6, Φ(−2)≈0.022≈1/40, and Φ(−3)≈0.0013≈1/750. (The approximation may be poor for |z| much larger than 2, but knowing Φ(−3) helps with the interpolation.) In conjunction with Φ(z)=1−Φ(−z) and a bit of interpolation, you can quickly estimate the probability to one significant figure, which is more than precise enough given the nature of the problem and the data.
Suppose route X takes 30 minutes with a standard deviation of 6 minutes and route Y takes 36 minutes with a standard deviation of 8 minutes. With enough data covering a wide range of conditions, the histograms of your data might eventually approximate these:
(These are probability density functions for Gamma(25, 30/25) and Gamma (20, 36/20) variables. Observe that they are decidedly skewed to the right, as one would expect for ride times.)
We therefore estimate the answer is 0.6 of the way between 0.5 and 0.84: 0.5 + 0.6*(0.84 – 0.5) = approximately 0.70. (The correct but overly precise value for the Normal distribution is 0.73.)
There’s about a 70% chance that route Y will take longer than route X. Doing this calculation in your head will take your mind off the next hill. 🙂
(The correct probability for the histograms shown is 72%, even though neither is Normal: this illustrates the scope and utility of the Normal approximation for the difference in trip times.)