What is the expected number of times you must roll a die until each side has appeared 3 times?
This question was asked in primary school in New Zealand and it was solved using simulations. What is the analytical solution for this problem?
Answer
Suppose all $d=6$ sides have equal chances. Let’s generalize and find the expected number of rolls needed until side $1$ has appeared $n_1$ times, side $2$ has appeared $n_2$ times, …, and side $d$ has appeared $n_d$ times. Because the identities of the sides do not matter (they all have equal chances), the description of this objective can be condensed: let us suppose that $i_0$ sides don’t have to appear at all, $i_1$ of the sides need to appear just once, …, and $i_n$ of the sides have to appear $n=\max(n_1,n_2,\ldots,n_d)$ times. Let $$\mathbf{i}=(i_0,i_1,\ldots,i_n)$$ designate this situation and write $$e(\mathbf{i})$$ for the expected number of rolls. The question asks for $e(0,0,0,6)$: $i_3 = 6$ indicates all six sides need to be seen three times each.
An easy recurrence is available. At the next roll, the side that appears corresponds to one of the $i_j$: that is, either we didn’t need to see it, or we needed to see it once, …, or we needed to see it $n$ more times. $j$ is the number of times we needed to see it.

When $j=0$, we didn’t need to see it and nothing changes. This happens with probability $i_0/d$.

When $j \gt 0$ then we did need to see this side. Now there is one less side that needs to seen $j$ times and one more side that needs to be seen $j1$ times. Thus, $i_j$ becomes $i_j1$ and $i_{j1}$ becomes $i_j+1$. Let this operation on the components of $\mathbf{i}$ be designated $\mathbf{i}\cdot j$, so that
$$\mathbf{i}\cdot j = (\color{gray}{i_0, \ldots, i_{j2}}, i_{j1}+1, i_j1, \color{gray}{i_{j+1},\ldots, i_n}).$$
This happens with probability $i_j/d$.
We merely have to count this die roll and use recursion to tell us how many more rolls are expected. By the laws of expectation and total probability,
$$e(\mathbf{i}) = 1 + \frac{i_0}{d}e(\mathbf{i}) + \sum_{j=1}^n \frac{i_j}{d}e(\mathbf{i}\cdot j)$$
(Let’s understand that whenever $i_j=0$, the corresponding term in the sum is zero.)
If $i_0=d$, we are done and $e(\mathbf{i}) =0$. Otherwise we may solve for $e(\mathbf{i})$, giving the desired recursive formula
$$e(\mathbf{i}) = \frac{d + i_1 e(\mathbf{i}\cdot 1) + \cdots + i_n e(\mathbf{i}\cdot n)}{d – i_0}.\tag{1}$$
Notice that $$\mathbf{i} = 0(i_0) + 1(i_1) + \cdots + n(i_n)$$ is the total number of events we wish to see. The operation $\cdot j$ reduces that quantity by one for any $j\gt 0$ provided $i_j \gt 0$, which is always the case. Therefore this recursion terminates at a depth of precisely $\mathbf{i}$ (equal to $3(6) = 18$ in the question). Moreover (as is not difficult to check) the number of possibilities at each recursion depth in this question is small (never exceeding $8$). Consequently, this is an efficient method, at least when the combinatorial possibilities are not too numerous and we memoize the intermediate results (so that no value of $e$ is calculated more than once).
I compute that $$e(0,0,0,6) = \frac{2\,286\,878\,604\,508\,883}{69\,984\,000\,000\,000}\approx 32.677.$$
That seemed awfully small to me, so I ran a simulation (using R
). After over three million rolls of the dice, this game had been played to its completion over 100,000 times, with an average length of $32.669$. The standard error of that estimate is $0.027$: the difference between this average and the theoretical value is insignificant, confirming the accuracy of the theoretical value.
The distribution of lengths may be of interest. (Obviously it must begin at $18$, the minimum number of rolls needed to collect all six sides three times each.)
# Specify the problem
d < 6 # Number of faces
k < 3 # Number of times to see each
N < 3.26772e6 # Number of rolls
# Simulate many rolls
set.seed(17)
x < sample(1:d, N, replace=TRUE)
# Use these rolls to play the game repeatedly.
totals < sapply(1:d, function(i) cumsum(x==i))
n < 0
base < rep(0, d)
i.last < 0
n.list < list()
for (i in 1:N) {
if (min(totals[i, ]  base) >= k) {
base < totals[i, ]
n < n+1
n.list[[n]] < i  i.last
i.last < i
}
}
# Summarize the results
sim < unlist(n.list)
mean(sim)
sd(sim) / sqrt(length(sim))
length(sim)
hist(sim, main="Simulation results", xlab="Number of rolls", freq=FALSE, breaks=0:max(sim))
Implementation
Although the recursive calculation of $e$ is simple, it presents some challenges in some computing environments. Chief among these is storing the values of $e(\mathbf{i})$ as they are computed. This is essential, for otherwise each value will be (redundantly) computed a very large number of times. However, the storage potentially needed for an array indexed by $\mathbf{i}$ could be enormous. Ideally, only values of $\mathbf{i}$ that are actually encountered during the computation should be stored. This calls for a kind of associative array.
To illustrate, here is working R
code. The comments describe the creation of a simple “AA” (associative array) class for storing intermediate results. Vectors $\mathbf{i}$ are converted to strings and those are used to index into a list E
that will hold all the values. The $\mathbf{i}\cdot j$ operation is implemented as %.%
.
These preliminaries enable the recursive function $e$ to be defined rather simply in a way that parallels the mathematical notation. In particular, the line
x < (d + sum(sapply(1:n, function(i) j[i+1]*e.(j %.% i))))/(d  j[1])
is directly comparable to the formula $(1)$ above. Note that all indexes have been increased by $1$ because R
starts indexing its arrays at $1$ rather than $0$.
Timing shows it takes $0.01$ seconds to compute e(c(0,0,0,6))
; its value is
32.6771634160506
Accumulated floating point roundoff error has destroyed the last two digits (which should be 68
rather than 06
).
e < function(i) {
#
# Create a data structure to "memoize" the values.
#
`[[<.AA` < function(x, i, value) {
class(x) < NULL
x[[paste(i, collapse=",")]] < value
class(x) < "AA"
x
}
`[[.AA` < function(x, i) {
class(x) < NULL
x[[paste(i, collapse=",")]]
}
E < list()
class(E) < "AA"
#
# Define the "." operation.
#
`%.%` < function(i, j) {
i[j+1] < i[j+1]1
i[j] < i[j] + 1
return(i)
}
#
# Define a recursive version of this function.
#
e. < function(j) {
#
# Detect initial conditions and return initial values.
#
if (min(j) < 0  sum(j[1])==0) return(0)
#
# Look up the value (if it has already been computed).
#
x < E[[j]]
if (!is.null(x)) return(x)
#
# Compute the value (for the first and only time).
#
d < sum(j)
n < length(j)  1
x < (d + sum(sapply(1:n, function(i) j[i+1]*e.(j %.% i))))/(d  j[1])
#
# Store the value for later reuse.
#
E[[j]] << x
return(x)
}
#
# Do the calculation.
#
e.(i)
}
e(c(0,0,0,6))
Finally, here is the original Mathematica implementation that produced the exact answer. The memoization is accomplished via the idiomatic e[i_] := e[i] = ...
expression, eliminating almost all the R
preliminaries. Internally, though, the two programs are doing the same things in the same way.
shift[j_, x_List] /; Length[x] >= j >= 2 := Module[{i = x},
i[[j  1]] = i[[j  1]] + 1;
i[[j]] = i[[j]]  1;
i];
e[i_] := e[i] = With[{i0 = First@i, d = Plus @@ i},
(d + Sum[If[i[[k]] > 0, i[[k]] e[shift[k, i]], 0], {k, 2, Length[i]}])/(d  i0)];
e[{x_, y__}] /; Plus[y] == 0 := e[{x, y}] = 0
e[{0, 0, 0, 6}]
$\frac{2286878604508883}{69984000000000}$
Attribution
Source : Link , Question Author : Edgar Santos , Answer Author : whuber