Let’s suppose we have 4 discrete random variables, say X1,X2,X3,X4, with 3,2,2 and 3 states, respectively.
Then the joint probability distribution would require 3⋅2⋅2⋅3−1 parameters (we don’t know any independence relations).
Considering the Chain Rule, and considering the fact that you need one parameter, p, for the marginal distribution of each node with two states, and 2 for the ones with 3 states, we haveP(X4,X3,X2,X1)=P(X4X3,X2,X1)P(X3X2,X1)P(X2X1)P(X1)
so we need 3⋅2⋅2⋅2 parameters for the first conditional probability distribution (as there are 2⋅2⋅3 combinations of the first three variables and we need the 2 parameters of X4 for each one), 3⋅2 for the second one, 3 for the third one and 2 for the last one.
So… do we need 3⋅2⋅2⋅2+3⋅2+3+2 parameters?
It is actually true that 3⋅2⋅2⋅2+3⋅2+3+2=3⋅2⋅2⋅3 ?
Answer
It takes 3×2×2×3=36 numbers to write down a probability distribution on all possible values of these variables. They are redundant, because they must sum to 1. Therefore the number of (functionally independent) parameters is 35.
If you need more convincing (that was a rather handwaving argument), read on.
By definition, a sequence of such random variables is a measurable function
X=(X1,X2,X3,X4):Ω→R4
defined on a probability space (Ω,F,P). By limiting the range of X1 to a set of three elements (“states”), etc., you guarantee the range of X itself is limited to 3×2×2×3=36 possible values. Any probability distribution for X can be written as a set of 36 probabilities, one for each one of those values. The axioms of probability impose 36+1 constraints on those probabilities: they must be nonnegative (36 inequality constraints) and sum to unity (one equality constraint).
Conversely, any set of 36 numbers satisfying all 37 constraints gives a possible probability measure on Ω. It should be obvious how this works, but to be explicit, let’s introduce some notation:

Let the possible values of Xi be a(1)i,a(2)i,…,a(ki)i where Xi has ki possible values.

Let the nonnegative numbers, summing to 1, associated with a=(a(i1)1,a(i2)2,a(i3)3,a(i4)4) be written pi1i2i3i4.

For any vector of possible values a for X, we know (because random variables are measureable) that X−1(a)={ω∈Ω∣X(ω)=a} is a measurable set (in F). Define P(X−1(a))=pi1i2i3i4.
It is trivial to check that P is an Fmeasurable probability measure on Ω.
The set of all such pi1i2i3i4, with 36 subscripts, nonnegative values, and summing to unity, form the unit simplex in R36.
We have thereby a established a natural onetoone correspondence between the points of this simplex and the set of all possible probability distributions of all such X (regardless of what Ω or F might happen to be). The unit simplex in this case is a 36−1=35dimensional submanifoldwithcorners: any continuous (or differentiable, or algebraic) coordinate system for this set requires 35 numbers.
This construction is closely related to a basic tool used by Efron, Tibshirani, and others for studying the Bootstrap as well as to the influence function used to study Mestimators. It is called the “sampling representation.”
To see the connection, suppose you have a batch of 36 data points y1,y2,…,y36. A bootstrap sample consists of 36 independent realizations from the random variable X that has a p1=1/36 chance of equaling y1, a p2=1/36 chance of equaling y2, and so on: it is the empirical distribution.
To understand the properties of the Bootstrap and other resampling statistics, Efron et al consider modifying this to some other distribution where the pi are no longer necessarily equal to one another. For instance, by changing pk to 1/36+ϵ and changing all the other pj (j≠k) by −ϵ/35 you obtain (for sufficiently small ϵ) a distribution that represents overweighting the data value Xk (when ϵ is positive) or underweighting it (when ϵ is negative) or even deleting it altogether (when ϵ=−1/36), which leads to the “Jackknife”.
As such, this representation of all the weighted resampling possibilities by means of a vector p=(p1,p2,…,p36) allows us to visualize and reason about different resampling schemes as points on the unit simplex. The influence function of the value Xk for any (differentiable) functional statistic t, for instance, is simply proportional to the partial derivative of t(X) with respect to pk.
Reference
Efron and Tibshirani (1993), An Introduction to The Bootstrap (Chapters 20 and 21).
Attribution
Source : Link , Question Author : D1X , Answer Author : Community