Intuition for why likelihood function sometimes *is* a PDF

The likelihood function is not in general a PDF (there have been many questions on this). e.g. if we take the binomial likelihood, P(Evidence \mid \theta) = f(\theta) = {n \choose k} \theta^k (1-\theta)^{n-k} it does not integrate to 1. In general (and say, for n=2 and k=3):
\int_{0}^1 f(\theta) d\theta \neq 1

But I believe in some cases, the likelihood does integrate to 1. For example, if the likelihood function is a normal PDF, like in the case of a normal-normal conjugate prior setup. Then P(Evidence \mid \theta)=f(\theta)=NormalPDF_{\mu,\sigma}(\theta) and \int_{\mathbb{R}} f(\theta)d\theta = 1.

Is there an intuitive explanation for the fact that this particular likelihood function is a PDF?. Even better, can someone give insightful necessary and sufficient conditions for a likelihood function being a PDF?

Answer

The purpose of this answer is to show that the situation is so rich and complicated that it’s unlikely there exists any simple characterization of such distributional families.

I will first show, by construction, that there are many such families and they are flexible and varied. Then I will show that even this construction doesn’t cover the gamut of possibilities. In this process, though, we might improve our intuition about what it means for the likelihood of a single real parameter to be a density function.


When \theta can range over all the real numbers and is a location parameter — that is, when the distribution functions are all of the form f(x-\theta) for some density f — it is easy to see that integrating over the parameter \theta gives the constant value 1.

Let’s play with this a little. What if, for instance, we were to take two distinct densities f_1 and f_2 and let \theta play the role of a location parameter for each one of them, but in two different ways? For instance, form the family of functions

f(x,\theta) = a_1f_1(x-2\theta) + a_2f_2(x-\theta/2)

where the a_i are to be determined. By simple substitutions x=y+\theta and x=y+\theta/2, compute that

\begin{aligned}
\int_{\mathbb{R}}f(x,\theta)\,\mathrm{d}x &= \int_{\mathbb{R}}a_1f_1(x-2\theta)\,\mathrm{d}x + \int_{\mathbb{R}}a_2f_2(x-\theta/2)\,\mathrm{d}x\\
&= a_1\int_{\mathbb{R}}f_1(y)\,\mathrm{d}y + a_2\int_{\mathbb{R}}f_2(y)\,\mathrm{d}y\\
&= a_1+a_2.
\end{aligned}

Thus, provided a_1+a_2 = 1 and f(x,\theta)\ge 0 for all x, x\to f(x,\theta) is a probability density. When we integrate over the parameter \theta we obtain, using the same methods of substituting \theta=(y+x)/2 and \theta=2(y+x),

\begin{aligned}
\int_{\mathbb{R}}f(x,\theta)\,\mathrm{d}\theta &= \int_{\mathbb{R}}a_1f_1(x-2\theta)\,\mathrm{d}\theta + \int_{\mathbb{R}}a_2f_2(x-\theta/2)\,\mathrm{d}\theta \\
&= \frac{1}{2}a_1\int_{\mathbb{R}}f_1(y)\,\mathrm{d}y + 2a_2\int_{\mathbb{R}}f_2(y)\,\mathrm{d}y\\
&= \frac{1}{2}a_1+2a_2.
\end{aligned}

By setting a_1=2/3 and a_2=1/3 we can make this result unity for all x as well as guaranteeing f has no negative values, thereby satisfying the conditions of the problem. With some care we can also make this family of distributions identifiable in the sense that each \theta determines a unique distribution, as I will show by example. However, \theta is not a location parameter.

An example illustrates why not. Let f_2 be the Uniform[0,1] density and f_1 be a Normal density with variance 1/3 and mean 0. Here are some plots of f for various values of \theta:

Figure

As \theta increases (from left to right), the rectangular part of the density (the Uniform component) marches slowly rightward while the curved part of the density (the Normal component) marches rightward four times faster. The resulting distributions are all obviously different. Effectively, \theta does determine a “location” of sorts, but it also determines the shape of the distribution. That’s why it’s not a location parameter.

This construction can be vastly generalized to create rich, flexible families of distributions having all the properties in the question but (in general) not being location families. For completeness, I will give the details before proceeding with the main question.

Let f:\mathbb{R}\times\mathbb{R}\to[0,\infty) be any integrable family of distribution functions; that is, for all numbers \lambda

\int_{\mathbb{R}}f(x,\lambda)\,\mathrm{d}x = 1.

Consider any distribution function G supported on the nonnegative real numbers and use it to form the family \mathcal G of functions g:\mathbb{R}\times\mathbb{R}\to[0,\infty) via

g(x,\theta) = \int_0^\infty f\left(x – \frac{\theta}{\lambda}\right)\,\mathrm{d}G(\lambda).

For each \theta this gives a density function because obviously g(x,\theta)\ge 0 and

\begin{aligned}
\int_\mathbb{R}g(x,\theta)\,\mathrm{d}x &= \int_\mathbb{R}\int_0^\infty f\left(x – \frac{\theta}{\lambda}\right)\,\mathrm{d}G(\lambda)\,\mathrm{d}x\\
&= \int_0^\infty \int_\mathbb{R}f\left(x – \frac{\theta}{\lambda}\right)\,\mathrm{d}x\,\mathrm{d}G(\lambda)\\
&= \int_0^\infty (1)\,\mathrm{d}G(\lambda)\\
&= 1.
\end{aligned}

Integrating instead over \theta using the substitution \theta=y\lambda yields

\begin{aligned}
\int_\mathbb{R}g(x,\theta)\,\mathrm{d}\theta&= \int_\mathbb{R}\int_0^\infty f\left(x – \frac{\theta}{\lambda}\right)\,\mathrm{d}G(\lambda)\,\mathrm{d}\theta\\
&= \int_0^\infty \int_\mathbb{R}f\left(x – \frac{\theta}{\lambda}\right)\,\mathrm{d}\theta\,\mathrm{d}G(\lambda)\\
&= \int_0^\infty \int_\mathbb{R}f\left(x – \frac{y\lambda}{\lambda}\right)\,\mathrm{d}\left(y\lambda\right)\,\mathrm{d}G(\lambda)\\
&= \int_0^\infty \int_\mathbb{R}f\left(x – y\right)\,\mathrm{d}y\,\lambda\,\mathrm{d}G(\lambda)\\
&= \int_0^\infty\lambda\,\mathrm{d}G(\lambda).
\end{aligned}

If we further stipulate that the expectation of G is unity, this shows that the family \mathcal G satisfies the conditions of the question. However, except in special cases, \theta is not a location parameter.


Let’s consider the natural follow-up question: when the likelihood is a PDF in the sense of the question, can we always represent the family as a mixture in the foregoing sense?

Unfortunately the answer is no. As a counterexample, consider the family of distribution functions given by

f(x,\theta) = 2\left(\left\{\theta\right\} + \left(x – \lfloor \theta \rfloor\right) – 2 \left\{\theta\right\}\left(x – \lfloor \theta \rfloor\right)\right)

where \lfloor \theta \rfloor is the greatest integer less than or equal to \theta and \left\{\theta\right\} = \theta – \lfloor \theta \rfloor is the fractional part of \theta (lying in the interval [0,1)).

This strange looking function describes distributions defined on intervals [n,n+1) (where n = \lfloor \theta \rfloor) that vary according to the fractional part of \theta. Here are some of their densities:

Figure 2

Here is a plot of f:

Figure 3

Now if this family had a location parameter \mu = \mu(\theta), we would be able to express each f(x,\theta) as a fixed function of x-\mu(\theta). Its level sets (contours) would therefore be unions of lines of the form x-\mu=\text{constant}; that is, of lines with 45 degree slopes. Geometrically, this means we can stretch and compress this image purely in the vertical (\theta) direction until its bright patches–where the density is nonzero–become a slanted band with parallel linear contours.

No matter how we might re-express the parameter \theta (in a continuous fashion), obviously there’s no way it can change this checkered pattern into such an image.

Attribution
Source : Link , Question Author : tmkadamcz , Answer Author : whuber

Leave a Comment