# Expected numbers of distinct colors when drawing without replacement

Consider an urn containing $N$ balls of $P$ different colors, with $p_i$
being the proportion of balls of color $i$ among the $N$ balls ($\sum_i p_i = 1$). I draw $n \leq N$ balls from the urn without replacement and look
at the number $\gamma$ of different colors among the balls that were drawn. What is the
expectation of $\gamma$ as a function of $n/N$, depending on suitable
properties of the distribution $\mathbf{p}$?

To give more insight: if $N = P$ and $p_i = 1/P$ for all $i$, then I will always see
exactly $n$ colors, that is, $\gamma = P (n/N)$. Otherwise, it can be
shown that the expectation of $\gamma$ is $>P(n/N)$. For fixed $P$ and $N$, it
would seem that the factor by which to multiply $n/N$ would be maximal
when $\mathbf{p}$ is uniform; maybe the expected number of different
colors seen be bounded as a function of $n/N$ and, e.g., the entropy of
$\mathbf{p}$?

This seems related to the coupon collector’s problem, except that
sampling is performed without replacement, and the distribution of the
coupons is not uniform.

Suppose you have $k$ colors where $k \leq N$. Let $b_i$ denote the number of balls color $i$ so $\sum b_i = N$. Let $B = \{b_1, \ldots, b_k\}$ and let $E_i(B)$ notate the set which consists of the $i$ element subsets of $B$. Let $Q_{n, c}$ denote the number of ways we can choose $n$ elements from the above set such that the number of different colors in the chosen set is $c$. For $c = 1$ the formula is simple:

$$Q_{n, 1} = \sum_{E \in E_{1}(B)}\binom{\sum_{e \in E}e}{n}$$

For $c = 2$ we can count sets of balls of size $n$ which has at most 2 colors minus the number of sets which have exactly $1$ color:

$$Q_{n,2} = \sum_{E \in E_{2}(B)}\binom{\sum_{e \in E}e}{n} – \binom{k – 1}{1}Q_{n, 1}$$

$\binom{k – 1}{1}$ is the number of ways you can add a color to a fixed color such that you will have 2 colors if you have $k$ colors in total. The generic formula is if you have $c_1$ fixed colors and you want to make $c_2$ colors out of it while having $k$ colors in total($c_1 \leq c_2 \leq k$) is $\binom{k – c_1}{c_2 – c_1}$. Now we have everything to derive the generic formula for $Q_{n, c}$:

$$Q_{n, c} = \sum_{E \in E_{c}(B)}\binom{\sum_{e \in E}e}{n} – \sum_{i = 1}^{c – 1}\binom{k – i}{c – i}Q_{n, i}$$

The probability that you will have exactly $c$ colors if you draw $n$ balls is:

$$P_{n, c} = Q_{n, c} / \binom{N}{n}$$

Also note that $\binom{x}{y} = 0$ if $y > x$.

Probably there are special cases where the formula can be simplified. I didn’t bother to find those simplifications this time.

The expected value you’re looking for the number of colors dependent on $n$ is the following:

$$\gamma_{n} = \sum_{i = 1}^{k} P_{n, i} * i$$