In DeepMind’s AlphaGo Zero and AlphaZero papers, they describe adding Dirichlet noise to the prior probabilities of actions from the root node (board state) in Monte Carlo Tree Search:

Additional exploration is achieved by adding Dirichlet noise to the prior probabilities in the root node s0, specifically P(s,a)=(1−ε)pa+εηa, where η∼Dir(0.03) and ε=0.25; this noise ensures that all moves may be tried, but the search may still overrule bad moves.

(AlphaGo Zero)

And:

Dirichlet noise Dir(α) was added to the prior probabilities in the root node; this was scaled in inverse proportion to the approximate number of legal moves in a typical position, to a value of α={0.3,0.15,0.03} for chess, shogi and Go respectively.

(AlphaZero)

Two things I don’t understand:

`P(s, a)`

is an n-dimensional vector. Is Dir(α) shorthand for the Dirichlet distribution with n parameters, each with value α?I’ve only come across Dirichlet as the conjugate prior of the multinomial distribution. Why was it picked here?

For context,

`P(s, a)`

is just one component of the PUCT (polynomial upper confidence tree, a variant on upper confidence bounds) calculation for a given state/action. It’s scaled by a constant and a metric for how many times the given action has been selected amongst its siblings during MCTS, and added to the estimated action value`Q(s, a)`

:

`PUCT(s, a) = Q(s, a) + U(s, a)`

.- U(s,a)=cpuctP(s,a)√∑bN(s,b)1+N(s,a).

**Answer**

Question 1 is straightforward, here α is a vector of repetitions of the given value. (As answered by Max S.)

Question 2 is more interesting: The Dirichlet distribution has the following interpretation relevant in this context: When α is the observed vector of outcome-counts drawn from some (unknown) categorical distribution with outcome probabilities π, then Dir(α)(π) is the likelihood that Cat(π) is the actual underlying distribution given you observed α as the counts. (This is basically the definition of a dual distribution.)

Now `P(s,a)`

estimates the probability that a good player would play `a`

in `s`

, that is the parameters of his categorical distribution, which AlphaZero wants to learn. So Dir(α) would sample reasonable estimates for pi=`P(s,a)`

if we observed a good player play moves α-times.

But if some αi=0, then all π∼Dir(α) have πi=0, preventing exploration. By adding the noise they assume that they have observed every move being played some small number of times α (here chosen 0.3, 0.15, 0.03).

As for how they got the constants, my guess is that they assume to have observed ~10 random plays in every game: In chess, Dir(0.3) assumes that you have seen each move played 0.3 times. Given that there are ~35 moves available according to Allis, the authors assume you have seen ~10 random moves in every node. In Go, if we assume ~270 legal moves on average (3/4 of 361 board positions), we see an equivalent to observing ~8 random moves. (I do not have the data for Shogi.)

**Attribution***Source : Link , Question Author : monk , Answer Author : Tomáš Gavenčiak*