Definition of family of a distribution?

Does a family of a distribution have a different definition for statistics than in other disciplines?

In general, a family of curves is a set of curves, each of which is given by a function or parametrization in which one or more of the parameters is varied. Such families are used, for example, to characterize electronic components.

For statistics, a family according to one source is the result of varying the shape parameter. How then can we understand that the gamma distribution has a shape and scale parameter and only the generalized gamma distribution has, in addition, a location parameter? Does that make the family the result of varying the location parameter? According to @whuber the meaning of a family is implicitly A “parameterization” of a family is a continuous map from a subset of ℝ$^n$, with its usual topology, into the space of distributions, whose image is that family.

What, in simple language, is a family for statistical distributions?

A question about relations among of the statistical properties of distributions from the same family has already generated considerable controversy for a different question so it seems worthwhile to explore the meaning.

That this is not necessarily a simple question is born out by its use in the phrase exponential family, which has nothing to do with a family of curves, but is related to changing the form of the PDF of a distribution by reparameterization not only of parameters, but also substitution of functions of independent random variables.

Answer

The statistical and mathematical concepts are exactly the same, understanding that “family” is a generic mathematical term with technical variations adapted to different circumstances:

A parametric family is a curve (or surface or other finite-dimensional generalization thereof) in the space of all distributions.

The rest of this post explains what that means. As an aside, I don’t think any of this is controversial, either mathematically or statistically (apart from one minor issue which is noted below). In support of this opinion I have supplied many references (mostly to Wikipedia articles).


This terminology of “families” tends to be used when studying classes $\mathcal C_Y$ of functions into a set $Y$ or “maps.” Given a domain $X$, a family $\mathcal F$ of maps on $X$ parameterized by some set $\Theta$ (the “parameters”) is a function

$$\mathcal F : X\times \Theta\to Y$$

for which (1) for each $\theta\in\Theta$, the function $\mathcal{F}_\theta:X\to Y$ given by $\mathcal{F}_\theta(x)=\mathcal{F}(x,\theta)$ is in $\mathcal{C}_Y$ and (2) $\mathcal F$ itself has certain “nice” properties.

The idea is that we want to vary functions from $X$ to $Y$ in a “smooth” or controlled manner. Property (1) means that each $\theta$ designates such a function, while the details of property (2) will capture the sense in which a “small” change in $\theta$ induces a sufficiently “small” change in $\mathcal{F}_\theta$.

A standard mathematical example, close to the one mentioned in the question, is a homotopy. In this case $\mathcal{C}_Y$ is the category of continuous maps from topological spaces $X$ into the topological space $Y$; $\Theta=[0,1]\subset\mathbb{R}$ is the unit interval with its usual topology, and we require that $\mathcal{F}$ be a continuous map from the topological product $X \times \Theta$ into $Y$. It can be thought of as a “continuous deformation of the map $\mathcal{F}_0$ to $\mathcal{F}_1$.” When $X=[0,1]$ is itself an interval, such maps are curves in $Y$ and the homotopy is a smooth deformation from one curve to another.

For statistical applications, $\mathcal{C}_Y$ is the set of all distributions on $\mathbb{R}$ (or, in practice, on $\mathbb{R}^n$ for some $n$, but to keep the exposition simple I will focus on $n=1$). We may identify it with the set of all non-decreasing càdlàg functions $\mathbb{R}\to [0,1]$ where the closure of their range includes both $0$ and $1$: these are the cumulative distribution functions, or simply distribution functions. Thus, $X=\mathbb R$ and $Y=[0,1]$.

A family of distributions is any subset of $\mathcal{C}_Y$. Another name for a family is statistical model. It consists of all distributions that we suppose govern our observations, but we do not otherwise know which distribution is the actual one.

  • A family can be empty.
  • $\mathcal{C}_Y$ itself is a family.
  • A family may consist of a single distribution or just a finite number of them.

These abstract set-theoretic characteristics are of relatively little interest or utility. It is only when we consider additional (relevant) mathematical structure on $\mathcal{C}_Y$ that this concept becomes useful. But what properties of $\mathcal{C}_Y$ are of statistical interest? Some that show up frequently are:

  1. $\mathcal{C}_Y$ is a convex set: given any two distributions ${F}, {G}\in \mathcal{C}_Y$, we may form the mixture distribution $(1-t){F}+t{G}\in Y$ for all $t\in[0,1]$. This is a kind of “homotopy” from $F$ to $G$.

  2. Large parts of $\mathcal{C}_Y$ support various pseudo metrics, such as the Kullback-Leibler divergence or the closely related Fisher Information metric.

  3. $\mathcal{C}_Y$ has an additive structure: corresponding to any two distributions $F$ and $G$ is their sum, ${F}\star {G}$.

  4. $\mathcal{C}_Y$ supports many useful, natural functions, often termed “properties.” These include any fixed quantile (such as the median) as well as the cumulants.

  5. $\mathcal{C}_Y$ is a subset of a function space. As such, it inherits many useful metrics, such as the sup norm ($L^\infty$ norm) given by $$||F-G||_\infty = \sup_{x\in\mathbb{R}}|F(x)-G(x)|.$$

  6. Natural group actions on $\mathbb R$ induce actions on $\mathcal{C}_Y$. The commonest actions are translations $T_\mu:x \to x+\mu$ and scalings $S_\sigma:x\to x\sigma$ for $\sigma\gt 0$. The effect these have on a distribution is to send $F$ to the distribution given by $F^{\mu,\sigma}(x) = F((x-\mu)/\sigma)$. These lead to the concepts of location-scale families and their generalizations. (I don’t supply a reference, because extensive Web searches turn up a variety of different definitions: here, at least, may be a tiny bit of controversy.)

The properties that matter depend on the statistical problem and on how you intend to analyze the data. Addressing all the variations suggested by the preceding characteristics would take too much space for this medium. Let’s focus on one common important application.

Take, for instance, Maximum Likelihood. In most applications you will want to be able to use Calculus to obtain an estimate. For this to work, you must be able to “take derivatives” in the family.

(Technical aside: The usual way in which this is accomplished is to select a domain $\Theta\subset \mathbb{R}^d$ for $d\ge 0$ and specify a continuous, locally invertible function $p$ from $\Theta$ into $\mathcal{C}_Y$. (This means that for every $\theta\in\Theta$ there exists a ball $B(\theta, \epsilon)$, with $\epsilon\gt 0$ for which $p\mid_{B(\theta,\epsilon)}: B(\theta,\epsilon)\cap \Theta \to \mathcal{C}_Y$ is one-to-one. In other words, if we alter $\theta$ by a sufficiently small amount we will always get a different distribution.))

Consequently, in most ML applications we require that $p$ be continuous (and hopefully, almost everywhere differentiable) in the $\Theta$ component. (Without continuity, maximizing the likelihood generally becomes an intractable problem.) This leads to the following likelihood-oriented definition of a parametric family:

A parametric family of (univariate) distributions is a locally invertible map $$\mathcal{F}:\mathbb{R}\times\Theta \to [0,1],$$ with $\Theta\subset \mathbb{R}^n$, for which (a) each $\mathcal{F}_\theta$ is a distribution function and (b) for each $x\in\mathbb R$, the function $\mathcal{L}_x: \theta\to [0,1]$ given by $\mathcal{L}_x(\theta) = \mathcal{F}(x,\theta)$ is continuous and almost everywhere differentiable.

Note that a parametric family $\mathcal F$ is more than just the collection of $\mathcal{F}_\theta$: it also includes the specific way in which parameter values $\theta$ correspond to distributions.

Let’s end up with some illustrative examples.

  • Let $\mathcal{C}_Y$ be the set of all Normal distributions. As
    given, this is not a parametric family: it’s just a family. To be
    parametric, we have to choose a parameterization. One way is to
    choose $\Theta = \{(\mu,\sigma)\in\mathbb{R}^2\mid \sigma \gt 0\}$
    and to map $(\mu,\sigma)$ to the Normal distribution with mean $\mu$
    and variance $\sigma^2$.

  • The set of Poisson$(\lambda)$ distributions is a parametric family
    with $\lambda\in\Theta=(0,\infty)\subset\mathbb{R}^1$.

  • The set of Uniform$(\theta, \theta+1)$ distributions (which features
    prominently in many textbook exercises) is a parametric family with
    $\theta\in\mathbb{R}^1$. In this case, $F_\theta(x) = \max(0,
    \min(1, x-\theta))$ is differentiable in $\theta$ except for
    $\theta\in\{x, x-1\}$.

  • Let $F$ and $G$ be any two distributions. Then $\mathcal{F}(x,\theta)=(1-\theta)F(x)+\theta G(x)$ is a parametric family for $\theta\in[0,1]$. (Proof: the image of $\mathcal F$ is a set of distributions and its partial derivative in $\theta$ equals $-F(x)+G(x)$ which is defined everywhere.)

  • The Pearson family is a four-dimensional family, $\Theta\subset\mathbb{R}^4$, which includes (among others) the Normal distributions, Beta distributions, and Inverse Gamma distributions. This illustrates the fact that any one given distribution may belong to many different distribution families. This is perfectly analogous to observing that any point in a (sufficiently large) space may belong to many paths that intersect there. This, together with the previous construction, shows us that no distribution uniquely determines a family to which it belongs.

  • The family $\mathcal{C}_Y$ of all finite-variance absolutely continuous distributions is not parametric. The proof requires a deep theorem of topology: if we endow $\mathcal{C}_Y$ with any topology (whether statistically useful or not) and $p: \Theta\to\mathcal{C}_Y$ is continuous and locally has a continuous inverse, then locally $\mathcal{C}_Y$ must have the same dimension as that of $\Theta$. However, in all statistically meaningful topologies, $\mathcal{C}_Y$ is infinite dimensional.

Attribution
Source : Link , Question Author : Carl , Answer Author : whuber

Leave a Comment