I am unable to understand Thompson Sampling and how it works. I was reading about Multi Arm Bandit and after reading Upper Confidence Bound Algorithm, many text suggested that Thompson Sampling performs better than UCB. What is Thompson Sampling, in layman’s or simple terms?
Feel free to provide reference articles for further understanding.
I am going to try to give an explanation without any mathematics. Part of this answer is repeated from some points I made in an answer to another question on MAB problems.
The strategic trade-off in multi-arm bandit problems: In multi-arm bandit problems the gambler plays one “bandit” each round and attempts to maximise his total expected return over a given number of rounds. The expected return of each of the bandits is described by some unknown parameters in the problem, and so as we observe more outcomes in each round, we get more information about these unknown parameters, and hence, about the expected return of each of the bandits. In each round of play (except the last), the MAB problem involves a strategic trade-off by the gambler between two objectives:
Immediate rewards: In each round he would like to choose a distribution that gives him a high expected reward on this round, which entails a preference for distributions he (presently) infers to have a high mean reward;
Future rewards (affected by information gain): On the other hand, he wants to refine his knowledge of the true expected rewards by gaining more information on the distributions (especially those that he has not played as much as others), so that he can improve his choices in future rounds.
The relative importance of these two things will determine the trade-off, and this relative importance is affected by a number of factors. For example, if there is only a small number of remaining rounds in the problem then inference for future trials is relatively less valuable, whereas if there is a large number of remaining rounds then inference for future rewards is relatively more valuable. So the gambler needs to consider how much he wants to focus on maximising the immediate rewards in the current round, and how much he wants to deviate from this, to learn more about the unknown parameters that determine the expected reward of each of the bandits.
Thompson sampling: The basic idea of Thompson sampling is that in each round, we take our existing knowledge of the machines, which is in the form of a posterior belief about the unknown parameters, and we “sample” the parameters from this posterior distribution. This sampled parameter yields a set of expected rewards for each machine, and now we bet on the one with the highest expected return, under that sampled parameter.
Prima facie, the Thompson sampling scheme seems to involve an attempt to maximise the immediate expected return in each round (since it involves this maximisation step after sampling the parameter). However, because it involves random sampling of the parameter from the posterior, the scheme involves an implicit variation of maximising the present reward, versus searching for more information. Most of the time we will get a parameter “sample” that is somewhere in the main part of the posterior, and the choice of machine will roughly approximate maximisation of the immediate reward. However, sometimes we will randomly sample a parameter value that is far in the tails of the posterior distribution, and in that case we will end up choosing a machine that does not maximise the immediate reward – i.e., this will constitute more of a “search” to assist with future rewards.
The Thompson scheme also has the nice property that we tend to decrease our “search” as we get more information, and this mimics the desirable strategic trade-off in the problem, where we want to focus less on searches as we obtain more information. As we get play more and more rounds and get more and more data, the posterior converges closer to the true parameter values and so the random “sampling” in the Thompson scheme becomes more tightly packed around the parameter values that will lead to maximisation of the immediate reward. Hence, there is an implicit tendency of this scheme to be more “search-oriented” early on with little information, and less “search-oriented” later on when there is a lot of data.
Now, having said this, one clear drawback of the Thompson sampling scheme is that it does not take into account the number of rounds remaining in the MAB problem. This scheme is sometimes formulated on the basis of a game with infinite rounds, and in this case that is not an issue. However, in MAB problems with finite rounds, it is preferable to take account of the number of remaining rounds in order to decrease the “search” as the number of future rounds decreases. (And in particular, the optimal play in the last round is to ignore searches completely and just bet on the bandit with the highest posterior expected return.) The Thompson scheme does not do this, so it will play finite-round games in a way that is clearly sub-optimal in certain cases.