# What is the number of parameters needed for a joint probability distribution?

Let’s suppose we have $4$ discrete random variables, say $X_1, X_2, X_3, X_4$, with $3,2,2$ and $3$ states, respectively.

Then the joint probability distribution would require $3 \cdot2 \cdot2 \cdot 3-1$ parameters (we don’t know any independence relations).
Considering the Chain Rule, and considering the fact that you need one parameter, $p$, for the marginal distribution of each node with two states, and $2$ for the ones with $3$ states, we have

so we need $3 \cdot 2 \cdot 2 \cdot 2$ parameters for the first conditional probability distribution (as there are $2 \cdot 2 \cdot 3$ combinations of the first three variables and we need the $2$ parameters of $X_4$ for each one), $3 \cdot 2$ for the second one, $3$ for the third one and $2$ for the last one.

So… do we need $3 \cdot 2 \cdot 2 \cdot 2 +3 \cdot 2 + 3 +2$ parameters?

It is actually true that $3 \cdot 2 \cdot 2 \cdot 2 +3 \cdot 2 + 3 +2 =3 \cdot 2 \cdot 2 \cdot 3$ ?

It takes $$3×2×2×3=363\times 2 \times 2 \times 3 = 36$$ numbers to write down a probability distribution on all possible values of these variables. They are redundant, because they must sum to $$11$$. Therefore the number of (functionally independent) parameters is $$3535$$.

If you need more convincing (that was a rather hand-waving argument), read on.

By definition, a sequence of such random variables is a measurable function

$$X=(X1,X2,X3,X4):Ω→R4\mathbf{X}=(X_1,X_2,X_3,X_4):\Omega\to\mathbb{R}^4$$

defined on a probability space $$(Ω,F,P)(\Omega, \mathcal{F}, \mathbb{P})$$. By limiting the range of $$X1X_1$$ to a set of three elements (“states”), etc., you guarantee the range of $$X\mathbf{X}$$ itself is limited to $$3×2×2×3=363\times 2\times 2 \times 3=36$$ possible values. Any probability distribution for $$X\mathbf{X}$$ can be written as a set of $$3636$$ probabilities, one for each one of those values. The axioms of probability impose $$36+136+1$$ constraints on those probabilities: they must be nonnegative ($$3636$$ inequality constraints) and sum to unity (one equality constraint).

Conversely, any set of $$3636$$ numbers satisfying all $$3737$$ constraints gives a possible probability measure on $$Ω\Omega$$. It should be obvious how this works, but to be explicit, let’s introduce some notation:

• Let the possible values of $$XiX_i$$ be $$a(1)i,a(2)i,…,a(ki)ia_i^{(1)}, a_i^{(2)}, \ldots, a_i^{(k_i)}$$ where $$XiX_i$$ has $$kik_i$$ possible values.

• Let the nonnegative numbers, summing to $$11$$, associated with $$a=(a(i1)1,a(i2)2,a(i3)3,a(i4)4)\mathbf{a}=(a_1^{(i_1)}, a_2^{(i_2)}, a_3^{(i_3)}, a_4^{(i_4)})$$ be written $$pi1i2i3i4p_{i_1i_2i_3i_4}$$.

• For any vector of possible values $$a\mathbf{a}$$ for $$X\mathbf{X}$$, we know (because random variables are measureable) that $$X−1(a)={ω∈Ω∣X(ω)=a}\mathbf{X}^{-1}(\mathbf{a}) = \{\omega\in\Omega\mid \mathbf{X}(\omega)=\mathbf{a}\}$$ is a measurable set (in $$F\mathcal{F}$$). Define $$P(X−1(a))=pi1i2i3i4.\mathbb{P}\left(\mathbf{X}^{-1}(\mathbf{a})\right) = p_{i_1i_2i_3i_4}.$$

It is trivial to check that $$P\mathbb{P}$$ is an $$F\mathcal{F}$$-measurable probability measure on $$Ω\Omega$$.

The set of all such $$pi1i2i3i4p_{i_1i_2i_3i_4}$$, with $$3636$$ subscripts, nonnegative values, and summing to unity, form the unit simplex in $$R36\mathbb{R}^{36}$$.

We have thereby a established a natural one-to-one correspondence between the points of this simplex and the set of all possible probability distributions of all such $$X\mathbf{X}$$ (regardless of what $$Ω\Omega$$ or $$F\mathcal{F}$$ might happen to be). The unit simplex in this case is a $$36−1=3536-1=35$$-dimensional submanifold-with-corners: any continuous (or differentiable, or algebraic) coordinate system for this set requires $$3535$$ numbers.

This construction is closely related to a basic tool used by Efron, Tibshirani, and others for studying the Bootstrap as well as to the influence function used to study M-estimators. It is called the “sampling representation.”

To see the connection, suppose you have a batch of $$3636$$ data points $$y1,y2,…,y36y_1, y_2, \ldots, y_{36}$$. A bootstrap sample consists of $$3636$$ independent realizations from the random variable $$X\mathbf X$$ that has a $$p1=1/36p_1=1/36$$ chance of equaling $$y1y_1$$, a $$p2=1/36p_2=1/36$$ chance of equaling $$y2y_2$$, and so on: it is the empirical distribution.

To understand the properties of the Bootstrap and other resampling statistics, Efron et al consider modifying this to some other distribution where the $$pip_i$$ are no longer necessarily equal to one another. For instance, by changing $$pkp_k$$ to $$1/36+ϵ1/36 + \epsilon$$ and changing all the other $$pjp_j$$ ($$j≠kj\ne k$$) by $$−ϵ/35-\epsilon/35$$ you obtain (for sufficiently small $$ϵ\epsilon$$) a distribution that represents overweighting the data value $$XkX_k$$ (when $$ϵ\epsilon$$ is positive) or underweighting it (when $$ϵ\epsilon$$ is negative) or even deleting it altogether (when $$ϵ=−1/36\epsilon=-1/36$$), which leads to the “Jackknife”.

As such, this representation of all the weighted resampling possibilities by means of a vector $$p=(p1,p2,…,p36)\mathbf{p} = (p_1,p_2, \ldots, p_{36})$$ allows us to visualize and reason about different resampling schemes as points on the unit simplex. The influence function of the value $$XkX_k$$ for any (differentiable) functional statistic $$tt$$, for instance, is simply proportional to the partial derivative of $$t(X)t(X)$$ with respect to $$pkp_k$$.

### Reference

Efron and Tibshirani (1993), An Introduction to The Bootstrap (Chapters 20 and 21).