# What does it mean that a statistic T(X)T(X) is sufficient for a parameter?

I am having a hard time understanding what a sufficient statistic actually helps us do.

It says that

Given $$X1,X2,...,XnX_1, X_2, ..., X_n$$ from some distribution, a statistic $$T(X)T(X)$$ is sufficient for a parameter $$θ\theta$$ if

$$P(X1,X2,...,Xn|T(X),θ)=P(X1,X2,...,Xn|T(X))P(X_1, X_2, ..., X_n|T(X), \theta) = P(X_1, X_2, ..., X_n|T(X))$$.

Meaning, if we know $$T(X)T(X)$$, then we cannot gain any more information about the parameter $$θ\theta$$ by considering other functions of the data $$X1,X2,...,XnX_1, X_2, ..., X_n$$.

I have two questions:

1. It seems to me that the purpose of $$T(X)T(X)$$ is to make it so that we can calculate the pdf of a distribution more easily. If calculating the pdf yields a probability measure, then why is it said that we cannot “gain any more information about the parameter $$θθ$$“? In other words, why are we focused on $$T(X)T(X)$$ telling us something about $$\theta\theta$$ when the pdf spits out a probability measure, which isn’t $$\theta\theta$$?

2. When it says: “we cannot gain any more information about the parameter θ by considering other functions of the data $$X_1,X_2,…,X_nX_1,X_2,...,X_n$$.“, what other functions are they talking about? Is this akin to saying that if I randomly draw $$nn$$ samples and find $$T(X)T(X)$$, then any other set of $$nn$$ samples I draw give $$T(X)T(X)$$ also?

I think the best way to understand sufficiency is to consider familiar examples. Suppose we flip a (not necessarily fair) coin, where the probability of obtaining heads is some unknown parameter $p$. Then individual trials are IID Bernoulli(p) random variables, and we can think about the outcome of $n$ trials as being a vector $\boldsymbol X = (X_1, X_2, \ldots, X_n)$. Our intuition tells us that for a large number of trials, a “good” estimate of the parameter $p$ is the statistic Now think about a situation where I perform such an experiment. Could you estimate $p$ equally well if I inform you of $\bar X$, compared to $\boldsymbol X$? Sure. This is what sufficiency does for us: the statistic $T(\boldsymbol X) = \bar X$ is sufficient for $p$ because it preserves all the information we can get about $p$ from the original sample $\boldsymbol X$. (To prove this claim, however, needs more explanation.)
Here is a less trivial example. Suppose I have $n$ IID observations taken from a ${\rm Uniform}(0,\theta)$ distribution, where $\theta$ is the unknown parameter. What is a sufficient statistic for $\theta$? For instance, suppose I take $n = 5$ samples and I obtain $\boldsymbol X = (3, 1, 4, 5, 4)$. Your estimate for $\theta$ clearly must be at least $5$, since you were able to observe such a value. But that is the most knowledge you can extract from knowing the actual sample $\boldsymbol X$. The other observations convey no additional information about $\theta$ once you have observed $X_4 = 5$. So, we would intuitively expect that the statistic is sufficient for $\theta$. Indeed, to prove this, we would write the joint density for $\boldsymbol X$ conditioned on $\theta$, and use the Factorization Theorem (but I will omit this in the interest of keeping the discussion informal).
Note that a sufficient statistic is not necessarily scalar-valued. For it may not be possible to achieve data reduction of the complete sample into a single scalar. This commonly arises when we want sufficiency for multiple parameters (which we can equivalently regard as a single vector-valued parameter). For example, a sufficient statistic for a Normal distribution with unknown mean $\mu$ and standard deviation $\sigma$ is In fact, these are unbiased estimators of the mean and standard deviation. We can show that this is the maximum data reduction that can be achieved.
Note also that a sufficient statistic is not unique. In the coin toss example, if I give you $\bar X$, that will let you estimate $p$. But if I gave you $\sum_{i=1}^n X_i$, you can still estimate $p$. In fact, any one-to-one function $g$ of a sufficient statistic $T(\boldsymbol X)$ is also sufficient, since you can invert $g$ to recover $T$. So for the normal example with unknown mean and standard deviation, I could also have claimed that $\left( \sum_{i=1}^n X_i, \sum_{i=1}^n X_i^2 \right)$, i.e., the sum and sum of squared observations, are sufficient for $(\mu, \sigma)$. Indeed, the non-uniqueness of sufficiency is even more obvious, for $\boldsymbol T(\boldsymbol X) = \boldsymbol X$ is always sufficient for any parameter(s): the original sample always contains as much information as we can gather.