Likelihood Function is Minimal Sufficient

What does it mean to say that “Likelihood Function is Minimal Sufficient”? Is this a general statement, or does it apply to only exponential family of distributions?

I think I understand the concept of sufficient statistics, and minimal sufficient statistics.

Likelihood Function on the other hand is a function of the parameter with data plugged-in. We cannot calculate its value without knowing the parameter. It is different than say “sample mean” statistic for the mean parameter of a Normal distribution, which yields a concrete number. How is this a statistic then?

Then there is this lecture note, on section 2

where Prof. Wasserman gets philosophical about the issue which confused me even more.


Consider an observable data vector x=(x1,...,xn)X with a joint-distribution that is indexed by the parameter θΘ. It is possible to establish that “the likelihood function is minimal sufficient” in a sense described more specifically below.

Your misgivings here occur because you say that the likelihood is a function that depends on the parameter, and so we cannot calculate its value without the parameter. (This presumably means that it is not even a statistic, let alone a minimal sufficient statistic!) What you point out is true, but it misses the point, because the result asserted in your question is not that the value of the likelihood function at a particular point is minimal sufficient, but that the function is minimal sufficient. We will see below that when we talk about minimal sufficiency of “the likelihood function” we are really referring to the latter in its broadest sense, as a function of the data vector and the parameter.

The mapping from the support of the data to the likelihood function is a sufficient statistic: For a particular observed instance of the data vector x, the likelihood function is the mapping Lx:Θ[0,). It is a mapping that maps each value of θ to a real output, but the function itself is fixed by the specified value of x and the domain Θ. (This is just an instance where we have to be careful to distinguish a function from its value at a specific argument value.) Now, let T[0,)Θ be the space of all mappings from the parameter space Θ to the non-negative real numbers. For each data vector xX there is a corresponding likelihood function LxT, and so we may consider that there is a mapping T:XT that maps each possible observable data vector onto its corresponding likelihood function (i.e., we have T(x)=Lx). This mapping is “the likelihood function” in its broadest sense, where we have not yet specified the observed data vector.

You can see that this function T is a mapping from the domain of possible data vectors to a fixed codomain, so it is a statistic. This is the sense in which we can consider the “likelihood function” to be a statistic. That is, the “likelihood function” can be considered as a statistic, if we have not yet specified the observed data vector (so that it is a function of this data vector), and we are looking at the object as a function of the parameter, rather than looking at an outcome value that accrues from a particular parameter value. With this elaboration, we now understand that the function T is “the likelihood function” in this broadest sense.

Theorem 1: The function T is sufficient for θ.

Proof: For all θΘ we define the function gθ:TR by gθ(f)=f(θ). Using the fact that the likelihood function is proportional to the sampling density, we can write the sampling density as:
This establishes the Fisher-Neyman factorisation, which establishes the theorem.

Demonstrating minimal sufficiency: The above mapping T is a sufficient statistic, but it is not generally minimal sufficient for θ. To ensure minimal sufficiency, we need to narrow things down further than this, since the likelihood function is defined non-arbitrarily only up to proportionality. For this reason, if we want to get minimal sufficiency, we will need to consider two proportionate likelihood functions Lx and Lx to be “the same” function. In the linked lecture notes, you will see that this is done by defining an equivalence relation as follows:


This equivalence relation says that two likelihood functions Lx and Lx are “the same” when proportionate. It induces a corresponding partition on X that separates the observable data vectors into sets that yield “the same” likelihood function (in the sense of proportionality).

Theorem 2: The partition is the minimal sufficient partition for θ.

This theorem is left as an exercise in the linked lecture notes. Proof can be done by first establishing the sufficiency of the partition (which is assured by Theorem 1 above), and then showing that if you coarsen the partition, sufficiency is lost. Since this is set as an exercise in the notes for the current question, I will not prove the result here. Hopefully the above information gets you started, and more clearly establishes the statistic that you are dealing with.

The fact that the likelihood function (conceived as a statistic) is minimal sufficient, really should not be surprising. Indeed, it is practically a tautology, since sufficiency can be framed as a condition in terms of the likelihood function. Sufficiency means that the statistic captures all the required information about the indexing parameter. Since the likelihood is proportional to the sampling density, it is hardly surprising that it does this. Minimal sufficiency means that it does so without additional information, which is easily achieved by partitioning down to a level where the statistic/partition of interest is set by proportionality to the likelihood function.

The reason for the importance of minimal sufficiency is that every minimal sufficient statistic induces the same partition induced by the equivalence relation on the likelihood function. This means that every minimal sufficient statistic is a stand-in for the equivalence relation induced by proportionality of the likelihood function.

Source : Link , Question Author : Cagdas Ozgenc , Answer Author : Ben

Leave a Comment