What does it mean that a statistic T(X)T(X) is sufficient for a parameter?

I am having a hard time understanding what a sufficient statistic actually helps us do.

It says that

Given X1,X2,...,Xn from some distribution, a statistic T(X) is sufficient for a parameter θ if

P(X1,X2,...,Xn|T(X),θ)=P(X1,X2,...,Xn|T(X)).

Meaning, if we know T(X), then we cannot gain any more information about the parameter θ by considering other functions of the data X1,X2,...,Xn.

I have two questions:

  1. It seems to me that the purpose of T(X) is to make it so that we can calculate the pdf of a distribution more easily. If calculating the pdf yields a probability measure, then why is it said that we cannot “gain any more information about the parameter θ“? In other words, why are we focused on T(X) telling us something about \theta when the pdf spits out a probability measure, which isn’t \theta?

  2. When it says: “we cannot gain any more information about the parameter θ by considering other functions of the data X_1,X_2,…,X_n.“, what other functions are they talking about? Is this akin to saying that if I randomly draw n samples and find T(X), then any other set of n samples I draw give T(X) also?

Answer

I think the best way to understand sufficiency is to consider familiar examples. Suppose we flip a (not necessarily fair) coin, where the probability of obtaining heads is some unknown parameter p. Then individual trials are IID Bernoulli(p) random variables, and we can think about the outcome of n trials as being a vector \boldsymbol X = (X_1, X_2, \ldots, X_n). Our intuition tells us that for a large number of trials, a “good” estimate of the parameter p is the statistic \bar X = \frac{1}{n} \sum_{i=1}^n X_i. Now think about a situation where I perform such an experiment. Could you estimate p equally well if I inform you of \bar X, compared to \boldsymbol X? Sure. This is what sufficiency does for us: the statistic T(\boldsymbol X) = \bar X is sufficient for p because it preserves all the information we can get about p from the original sample \boldsymbol X. (To prove this claim, however, needs more explanation.)

Here is a less trivial example. Suppose I have n IID observations taken from a {\rm Uniform}(0,\theta) distribution, where \theta is the unknown parameter. What is a sufficient statistic for \theta? For instance, suppose I take n = 5 samples and I obtain \boldsymbol X = (3, 1, 4, 5, 4). Your estimate for \theta clearly must be at least 5, since you were able to observe such a value. But that is the most knowledge you can extract from knowing the actual sample \boldsymbol X. The other observations convey no additional information about \theta once you have observed X_4 = 5. So, we would intuitively expect that the statistic T(\boldsymbol X) = X_{(n)} = \max \boldsymbol X is sufficient for \theta. Indeed, to prove this, we would write the joint density for \boldsymbol X conditioned on \theta, and use the Factorization Theorem (but I will omit this in the interest of keeping the discussion informal).

Note that a sufficient statistic is not necessarily scalar-valued. For it may not be possible to achieve data reduction of the complete sample into a single scalar. This commonly arises when we want sufficiency for multiple parameters (which we can equivalently regard as a single vector-valued parameter). For example, a sufficient statistic for a Normal distribution with unknown mean \mu and standard deviation \sigma is \boldsymbol T(\boldsymbol X) = \left( \frac{1}{n} \sum_{i=1}^n X_i, \sqrt{\frac{1}{n-1} \sum_{i=1}^n (X_i – \bar X)^2} \right). In fact, these are unbiased estimators of the mean and standard deviation. We can show that this is the maximum data reduction that can be achieved.

Note also that a sufficient statistic is not unique. In the coin toss example, if I give you \bar X, that will let you estimate p. But if I gave you \sum_{i=1}^n X_i, you can still estimate p. In fact, any one-to-one function g of a sufficient statistic T(\boldsymbol X) is also sufficient, since you can invert g to recover T. So for the normal example with unknown mean and standard deviation, I could also have claimed that \left( \sum_{i=1}^n X_i, \sum_{i=1}^n X_i^2 \right), i.e., the sum and sum of squared observations, are sufficient for (\mu, \sigma). Indeed, the non-uniqueness of sufficiency is even more obvious, for \boldsymbol T(\boldsymbol X) = \boldsymbol X is always sufficient for any parameter(s): the original sample always contains as much information as we can gather.

In summary, sufficiency is a desirable property of a statistic because it allows us to formally show that a statistic achieves some kind of data reduction. A sufficient statistic that achieves the maximum amount of data reduction is called a minimal sufficient statistic.

Attribution
Source : Link , Question Author : user123276 , Answer Author : heropup

Leave a Comment