I have a data set detailing a large number of cricket games (a few thousand). In cricket “bowlers” repeatedly throw a ball at a succession of “batsmen”. The bowler is trying to get the batsman “out”. In this respect it’s quite similar to pitchers and batters in baseball.

If I took the whole dataset and divided the total number of balls that got a batsman out by the total number of balls bowled, I can see that I would have the average probability of a bowler getting a batsman out – it will be around 0.03 (hopefully I haven’t gone wrong already?)

What I am interested in is what I can do to try and calculate the probability of a specific batsman being bowled out by a specific bowler on the next ball.

The dataset is large enough that any given bowler will have bowled thousands of balls to a wide range of batsmen. So I believe that I could simply divide the number of outs a bowler achieved by the number of balls he has bowled to calculate a new probability for that specific bowler getting an out from the next ball.

My problem is the dataset is not large enough to guarantee that a given bowler has bowled a statistically significant number of balls at any given batsmen. So if I’m interested in calculating the probability of an out for a specific bowler facing a specific batsmen I don’t think this can’t be done in the same simplistic way.

My question is whether the following approach is valid:

Across the whole dataset the probability of a ball getting an out is 0.03.

If I calculate that on average bowler A has a probability of getting on out of 0.06 (ie twice as likely as an average bowler),

and on average batsman B had a probability of being out of 0.01 (a third as likely as an average batsmen),

is it then valid to say the probability of that specific batsman being out on the next ball to that specific bowler is going to be 0.06 * (0.01 / 0.03) = 0.02?

**Answer**

If I took the whole dataset and divided the total number of balls that got a batsman out by the total number of balls bowled I can see that I would have the average probability of a bowler getting a batsman out – it will be around 0.03 (hopefully I haven’t gone wrong already?)

Unfortunately, this is maybe already not exactly what you’re looking for.

Suppose we have a single bowler, and two batsmen: Don Bradman and me. (I know very very little about cricket, so if I’m doing something way off here, let me know.) The games go something like:

- Don goes to bat, and is out on the 99th bowl.
- I go to bat, and am immediately out.
- Don goes to bat, and is out on the 99th bowl.
- I go to bat, and am immediately out.

In this case, there are four outs out of 200 bowls, so the marginal probability of a bowler getting a batsman out is estimated as 4/200 = 2%. But really, the Don’s probability of being out is more like 1%, whereas mine is 100%. So if you choose a batsman and a bowler at random, the probability that this bowler gets this batsman out this time is more like (50% chance you picked Don) * (1% chance he gets out) + (50% chance you picked me) * (100% chance I get out) = 50.05%. But if you choose a *pitch* at random, then it’s a 2% chance that it gets out. So you need to think carefully about which of those sampling models you’re thinking of.

Anyway, your proposal is not crazy. More symbolically, let b be the bowler and m the batsman; let f(b,m) be the probability that b gets m out. Then you’re saying:

f(b,m)=Em′[f(b,m′)]Eb′[f(b′,m)]Eb′,m′[f(b′,m′)].

This does have the desired property that:

Eb,m[f(b,m)]=Eb,m′[f(b,m′)]Eb′,m[f(b′,m)]Eb′,m′[f(b′,m′)]=Eb,m[f(b,m)];

it’s similarly consistent if you take means over only b or m.

Note that in this case we can assign

C:=Eb,m[f(b,m)]g(b):=Em[f(b,m)]/√Ch(m):=Eb[f(b,m)]/√Cso that f(b,m)=g(b)h(m).

Your assumption is that you can observe g(b) and h(m) reasonably well from the data. As long as (a) you have enough games [which you do] and (b) the players all play each other with reasonably similar frequencies, then this is fine.

To elaborate on (b) a bit: imagine that you have data from a bunch of professional games, and a bunch of games of me playing with my friends. If there’s no overlap, maybe I look really good compared to my friends, so maybe you think I’m much better than the worst professional player. This is obviously false, but you don’t have any data to refute that. If you have a little overlap though, where I played against a professional player one time and got destroyed, then the data does support ranking me and my friends as all way worse than the pros, but your method wouldn’t account for it. Technically, the problem here is that you’re assuming you have a good sample for e.g. Eb′[f(b′,m)], but your b′ distribution is biased.

Of course your data won’t look this bad, but depending on the league structure or whatever, it might have some elements of that problem.

You can try working around it with a different approach. The proposed model for f is actually an instance of low-rank matrix factorization models common in collaborative filtering, as in the Netflix problem. There, you choose the function g(b) and h(m) to be of dimension r, and represent f(b,m)=g(b)Th(m). You can interpret r>1 as complexifying your model from a single “quality” score to having scores along multiple dimensions: perhaps certain bowlers do better against certain types of batsmen. (This has been done e.g. for NBA games.)

The reason they’re called matrix factorization is because if you make a matrix F with as many rows as bowlers and as many columns as batsmen, you can write this as

[f(b1,m1)f(b1,m2)…f(b1,mM)f(b2,m1)f(b2,m2)…f(b2,mM)⋮⋮⋱⋮f(bN,m1)f(bN,m2)…f(bN,mM)]⏟F=[g(b1)⋮g(bN)]⏟G[h(m1)⋮h(mM)]T⏟HT

where you’ve factored an N×M matrix F into an N×r one G and an M×r one H.

Of course, you don’t get to observe F directly. The usual model is that you get to observe noisy entries of F at random; in your case, you get to observe a draw from a binomial distribution with a random number of trials for each entry of F.

You could construct a probability model like, say:

Gik∼N(0,σ2G)Hjk∼N(0,σ2H)Fij=GTiHjRij∼Binomial(nij,Fij)

where the nij and Rij are observed, and you’d probably put some hyperpriors over σG/σH and do inference e.g. in Stan.

This isn’t a perfect model: for one, it ignores that n is correlated to the scores (as I mentioned in the first section), and more importantly, it doesn’t constrain Fij to be in [0,1] (you’d probably use a logistic sigmoid or similar to achieve that). A related article, with more complex priors for G and H (but that doesn’t use the binomial likelihood) is: Salakhutdinov and Mnih, *Bayesian probabilistic matrix factorization using Markov chain Monte Carlo*, ICML 2008. (doi / author’s pdf)

**Attribution***Source : Link , Question Author : Ravi , Answer Author : Danica*