What is the number of parameters needed for a joint probability distribution?

Let’s suppose we have 4 discrete random variables, say X1,X2,X3,X4, with 3,2,2 and 3 states, respectively.

Then the joint probability distribution would require 32231 parameters (we don’t know any independence relations).
Considering the Chain Rule, and considering the fact that you need one parameter, p, for the marginal distribution of each node with two states, and 2 for the ones with 3 states, we have


so we need 3222 parameters for the first conditional probability distribution (as there are 223 combinations of the first three variables and we need the 2 parameters of X4 for each one), 32 for the second one, 3 for the third one and 2 for the last one.

So… do we need 3222+32+3+2 parameters?

It is actually true that 3222+32+3+2=3223 ?


It takes 3×2×2×3=36 numbers to write down a probability distribution on all possible values of these variables. They are redundant, because they must sum to 1. Therefore the number of (functionally independent) parameters is 35.

If you need more convincing (that was a rather hand-waving argument), read on.

By definition, a sequence of such random variables is a measurable function


defined on a probability space (Ω,F,P). By limiting the range of X1 to a set of three elements (“states”), etc., you guarantee the range of X itself is limited to 3×2×2×3=36 possible values. Any probability distribution for X can be written as a set of 36 probabilities, one for each one of those values. The axioms of probability impose 36+1 constraints on those probabilities: they must be nonnegative (36 inequality constraints) and sum to unity (one equality constraint).

Conversely, any set of 36 numbers satisfying all 37 constraints gives a possible probability measure on Ω. It should be obvious how this works, but to be explicit, let’s introduce some notation:

  • Let the possible values of Xi be a(1)i,a(2)i,,a(ki)i where Xi has ki possible values.

  • Let the nonnegative numbers, summing to 1, associated with a=(a(i1)1,a(i2)2,a(i3)3,a(i4)4) be written pi1i2i3i4.

  • For any vector of possible values a for X, we know (because random variables are measureable) that X1(a)={ωΩX(ω)=a} is a measurable set (in F). Define P(X1(a))=pi1i2i3i4.

It is trivial to check that P is an F-measurable probability measure on Ω.

The set of all such pi1i2i3i4, with 36 subscripts, nonnegative values, and summing to unity, form the unit simplex in R36.

We have thereby a established a natural one-to-one correspondence between the points of this simplex and the set of all possible probability distributions of all such X (regardless of what Ω or F might happen to be). The unit simplex in this case is a 361=35-dimensional submanifold-with-corners: any continuous (or differentiable, or algebraic) coordinate system for this set requires 35 numbers.

This construction is closely related to a basic tool used by Efron, Tibshirani, and others for studying the Bootstrap as well as to the influence function used to study M-estimators. It is called the “sampling representation.”

To see the connection, suppose you have a batch of 36 data points y1,y2,,y36. A bootstrap sample consists of 36 independent realizations from the random variable X that has a p1=1/36 chance of equaling y1, a p2=1/36 chance of equaling y2, and so on: it is the empirical distribution.

To understand the properties of the Bootstrap and other resampling statistics, Efron et al consider modifying this to some other distribution where the pi are no longer necessarily equal to one another. For instance, by changing pk to 1/36+ϵ and changing all the other pj (jk) by ϵ/35 you obtain (for sufficiently small ϵ) a distribution that represents overweighting the data value Xk (when ϵ is positive) or underweighting it (when ϵ is negative) or even deleting it altogether (when ϵ=1/36), which leads to the “Jackknife”.

As such, this representation of all the weighted resampling possibilities by means of a vector p=(p1,p2,,p36) allows us to visualize and reason about different resampling schemes as points on the unit simplex. The influence function of the value Xk for any (differentiable) functional statistic t, for instance, is simply proportional to the partial derivative of t(X) with respect to pk.


Efron and Tibshirani (1993), An Introduction to The Bootstrap (Chapters 20 and 21).

Source : Link , Question Author : D1X , Answer Author : Community

Leave a Comment