Imagine the following setup: You have 2 coins, coin A which is guaranteed to be fair, and coin B which may or may not be fair. You are asked to do 100 coin flips, and your objective is to maximize the number of heads.
Your prior information about coin B is that it was flipped 3 times and yielded 1 head. If your decision rule was simply based on comparing the expected probability of heads of the 2 coins, you would flip coin A 100 times and be done with it. This is true even when using reasonable Bayesian estimations (posterior means) of the probabilities, since you have no reason to believe that coin B yields more heads.
However, what if coin B is actually biased in favor of heads? Surely the “potential heads” you give up by flipping coin B a couple of times (and therefore gaining information about its statistical properties) would be valuable in some sense and therefore would factor into your decision. How can this “value of information” be described mathematically?
Question: How do you construct an optimal decision rule mathematically in this scenario?
This is a particular case of a Multi-Armed bandit problem. I say a particular case because generally we don’t know any of the probabilities of heads (in this case we know one of the coins has probability 0.5).
The issue you raise is known as the exploration vs exploitation dilemma: do you explore the other options, or do you stick with what you think is the best. There is an immediate optimal solution assuming you knew all probabilities: simply choose the coin with the highest probability of winning. The problem, as you have alluded to, is that we are unsure about what the true probabilities are.
There is lots of literature on the subject, and there are many deterministic algorithms, but since you tagged this Bayesian, I’d like to tell you about my personal favourite solution: the Bayesian Bandit!
The Baysian Bandit Solution
The Bayesian approach to this problem is very natural. We are interested in answering “What is the probability that coin X is the better of the two?”.
A priori, assuming we have observed no coin flips yet, we have no idea what the probability of coin B’s Heads might be, denote this unknown $p_B$. So we should assign a prior uniform distribution to this unknown probability. Alternatively, our prior (and posterior) for coin A is trivially concentrated entirely at 1/2.
As you have stated, we observe 2 tails and 1 heads from coin B, we need to update our posterior distribution. Assuming a uniform prior, and flips are Bernoulli coin-flips, our posterior is a $Beta( 1 + 1, 1 + 2)$. Comparing the posterior distributions or A and B now:
Finding an approximately optimal strategy
Now that we have the posteriors, what to do? We are interested in answering “What is the probability coin B is the better of the two” (Remember from our Bayesian perspective, although there is a definite answer to which one is better, we can only speak in probabilities):
$$w_B = P( p_b > 0.5 )$$
The approximately optimal solution is to choose B with probability $w_B$ and A with probability $1 – w_B$. This scheme maximizes out expected gains. $w_B$ can be computed in calculated numerically, as we know the posterior distribution, but an interesting way is the following:
1. Sample P_B from the posterior of coin B 2. If P_B > 0.5, choose coin B, else choose coin A.
This scheme is also self-updating. When we observe the outcome of choosing coin B, we update our posterior with this new information, and select again. This way, if coin B is really bad we will choose it less, and it coin B is in fact really good, we will choose it more often. Of course, we are Bayesians, hence we can never be absolutely sure coin B is better. Choosing probabilistically like this is the most natural solution to the exploration-exploitation dilemma.
This is a particular example of Thompson Sampling. More information, and cool applications to online advertising, can be found in Google’s research paper and Yahoo’s research paper. I love this stuff!