# Intuition for why likelihood function sometimes *is* a PDF

The likelihood function is not in general a PDF (there have been many questions on this). e.g. if we take the binomial likelihood, $$P(Evidence \mid \theta) = f(\theta) = {n \choose k} \theta^k (1-\theta)^{n-k}P(Evidence \mid \theta) = f(\theta) = {n \choose k} \theta^k (1-\theta)^{n-k}$$ it does not integrate to 1. In general (and say, for $$n=2n=2$$ and $$k=3k=3$$):
$$\int_{0}^1 f(\theta) d\theta \neq 1\int_{0}^1 f(\theta) d\theta \neq 1$$

But I believe in some cases, the likelihood does integrate to 1. For example, if the likelihood function is a normal PDF, like in the case of a normal-normal conjugate prior setup. Then $$P(Evidence \mid \theta)=f(\theta)=NormalPDF_{\mu,\sigma}(\theta)P(Evidence \mid \theta)=f(\theta)=NormalPDF_{\mu,\sigma}(\theta)$$ and $$\int_{\mathbb{R}} f(\theta)d\theta = 1\int_{\mathbb{R}} f(\theta)d\theta = 1$$.

Is there an intuitive explanation for the fact that this particular likelihood function is a PDF?. Even better, can someone give insightful necessary and sufficient conditions for a likelihood function being a PDF?

The purpose of this answer is to show that the situation is so rich and complicated that it’s unlikely there exists any simple characterization of such distributional families.

I will first show, by construction, that there are many such families and they are flexible and varied. Then I will show that even this construction doesn’t cover the gamut of possibilities. In this process, though, we might improve our intuition about what it means for the likelihood of a single real parameter to be a density function.

When $$\theta\theta$$ can range over all the real numbers and is a location parameter — that is, when the distribution functions are all of the form $$f(x-\theta)f(x-\theta)$$ for some density $$ff$$ — it is easy to see that integrating over the parameter $$\theta\theta$$ gives the constant value $$1.1.$$

Let’s play with this a little. What if, for instance, we were to take two distinct densities $$f_1f_1$$ and $$f_2f_2$$ and let $$\theta\theta$$ play the role of a location parameter for each one of them, but in two different ways? For instance, form the family of functions

$$f(x,\theta) = a_1f_1(x-2\theta) + a_2f_2(x-\theta/2)f(x,\theta) = a_1f_1(x-2\theta) + a_2f_2(x-\theta/2)$$

where the $$a_ia_i$$ are to be determined. By simple substitutions $$x=y+\thetax=y+\theta$$ and $$x=y+\theta/2,x=y+\theta/2,$$ compute that

\begin{aligned} \int_{\mathbb{R}}f(x,\theta)\,\mathrm{d}x &= \int_{\mathbb{R}}a_1f_1(x-2\theta)\,\mathrm{d}x + \int_{\mathbb{R}}a_2f_2(x-\theta/2)\,\mathrm{d}x\\ &= a_1\int_{\mathbb{R}}f_1(y)\,\mathrm{d}y + a_2\int_{\mathbb{R}}f_2(y)\,\mathrm{d}y\\ &= a_1+a_2. \end{aligned}\begin{aligned} \int_{\mathbb{R}}f(x,\theta)\,\mathrm{d}x &= \int_{\mathbb{R}}a_1f_1(x-2\theta)\,\mathrm{d}x + \int_{\mathbb{R}}a_2f_2(x-\theta/2)\,\mathrm{d}x\\ &= a_1\int_{\mathbb{R}}f_1(y)\,\mathrm{d}y + a_2\int_{\mathbb{R}}f_2(y)\,\mathrm{d}y\\ &= a_1+a_2. \end{aligned}

Thus, provided $$a_1+a_2 = 1a_1+a_2 = 1$$ and $$f(x,\theta)\ge 0f(x,\theta)\ge 0$$ for all $$x,x,$$ $$x\to f(x,\theta)x\to f(x,\theta)$$ is a probability density. When we integrate over the parameter $$\theta\theta$$ we obtain, using the same methods of substituting $$\theta=(y+x)/2\theta=(y+x)/2$$ and $$\theta=2(y+x),\theta=2(y+x),$$

\begin{aligned} \int_{\mathbb{R}}f(x,\theta)\,\mathrm{d}\theta &= \int_{\mathbb{R}}a_1f_1(x-2\theta)\,\mathrm{d}\theta + \int_{\mathbb{R}}a_2f_2(x-\theta/2)\,\mathrm{d}\theta \\ &= \frac{1}{2}a_1\int_{\mathbb{R}}f_1(y)\,\mathrm{d}y + 2a_2\int_{\mathbb{R}}f_2(y)\,\mathrm{d}y\\ &= \frac{1}{2}a_1+2a_2. \end{aligned}\begin{aligned} \int_{\mathbb{R}}f(x,\theta)\,\mathrm{d}\theta &= \int_{\mathbb{R}}a_1f_1(x-2\theta)\,\mathrm{d}\theta + \int_{\mathbb{R}}a_2f_2(x-\theta/2)\,\mathrm{d}\theta \\ &= \frac{1}{2}a_1\int_{\mathbb{R}}f_1(y)\,\mathrm{d}y + 2a_2\int_{\mathbb{R}}f_2(y)\,\mathrm{d}y\\ &= \frac{1}{2}a_1+2a_2. \end{aligned}

By setting $$a_1=2/3a_1=2/3$$ and $$a_2=1/3a_2=1/3$$ we can make this result unity for all $$xx$$ as well as guaranteeing $$ff$$ has no negative values, thereby satisfying the conditions of the problem. With some care we can also make this family of distributions identifiable in the sense that each $$\theta\theta$$ determines a unique distribution, as I will show by example. However, $$\theta\theta$$ is not a location parameter.

An example illustrates why not. Let $$f_2f_2$$ be the Uniform$$[0,1][0,1]$$ density and $$f_1f_1$$ be a Normal density with variance $$1/31/3$$ and mean $$0.0.$$ Here are some plots of $$ff$$ for various values of $$\theta:\theta:$$ As $$\theta\theta$$ increases (from left to right), the rectangular part of the density (the Uniform component) marches slowly rightward while the curved part of the density (the Normal component) marches rightward four times faster. The resulting distributions are all obviously different. Effectively, $$\theta\theta$$ does determine a “location” of sorts, but it also determines the shape of the distribution. That’s why it’s not a location parameter.

This construction can be vastly generalized to create rich, flexible families of distributions having all the properties in the question but (in general) not being location families. For completeness, I will give the details before proceeding with the main question.

Let $$f:\mathbb{R}\times\mathbb{R}\to[0,\infty)f:\mathbb{R}\times\mathbb{R}\to[0,\infty)$$ be any integrable family of distribution functions; that is, for all numbers $$\lambda\lambda$$

$$\int_{\mathbb{R}}f(x,\lambda)\,\mathrm{d}x = 1.\int_{\mathbb{R}}f(x,\lambda)\,\mathrm{d}x = 1.$$

Consider any distribution function $$GG$$ supported on the nonnegative real numbers and use it to form the family $$\mathcal G\mathcal G$$ of functions $$g:\mathbb{R}\times\mathbb{R}\to[0,\infty)g:\mathbb{R}\times\mathbb{R}\to[0,\infty)$$ via

$$g(x,\theta) = \int_0^\infty f\left(x – \frac{\theta}{\lambda}\right)\,\mathrm{d}G(\lambda).g(x,\theta) = \int_0^\infty f\left(x - \frac{\theta}{\lambda}\right)\,\mathrm{d}G(\lambda).$$

For each $$\theta\theta$$ this gives a density function because obviously $$g(x,\theta)\ge 0g(x,\theta)\ge 0$$ and

\begin{aligned} \int_\mathbb{R}g(x,\theta)\,\mathrm{d}x &= \int_\mathbb{R}\int_0^\infty f\left(x – \frac{\theta}{\lambda}\right)\,\mathrm{d}G(\lambda)\,\mathrm{d}x\\ &= \int_0^\infty \int_\mathbb{R}f\left(x – \frac{\theta}{\lambda}\right)\,\mathrm{d}x\,\mathrm{d}G(\lambda)\\ &= \int_0^\infty (1)\,\mathrm{d}G(\lambda)\\ &= 1. \end{aligned}\begin{aligned} \int_\mathbb{R}g(x,\theta)\,\mathrm{d}x &= \int_\mathbb{R}\int_0^\infty f\left(x - \frac{\theta}{\lambda}\right)\,\mathrm{d}G(\lambda)\,\mathrm{d}x\\ &= \int_0^\infty \int_\mathbb{R}f\left(x - \frac{\theta}{\lambda}\right)\,\mathrm{d}x\,\mathrm{d}G(\lambda)\\ &= \int_0^\infty (1)\,\mathrm{d}G(\lambda)\\ &= 1. \end{aligned}

Integrating instead over $$\theta\theta$$ using the substitution $$\theta=y\lambda\theta=y\lambda$$ yields

\begin{aligned} \int_\mathbb{R}g(x,\theta)\,\mathrm{d}\theta&= \int_\mathbb{R}\int_0^\infty f\left(x – \frac{\theta}{\lambda}\right)\,\mathrm{d}G(\lambda)\,\mathrm{d}\theta\\ &= \int_0^\infty \int_\mathbb{R}f\left(x – \frac{\theta}{\lambda}\right)\,\mathrm{d}\theta\,\mathrm{d}G(\lambda)\\ &= \int_0^\infty \int_\mathbb{R}f\left(x – \frac{y\lambda}{\lambda}\right)\,\mathrm{d}\left(y\lambda\right)\,\mathrm{d}G(\lambda)\\ &= \int_0^\infty \int_\mathbb{R}f\left(x – y\right)\,\mathrm{d}y\,\lambda\,\mathrm{d}G(\lambda)\\ &= \int_0^\infty\lambda\,\mathrm{d}G(\lambda). \end{aligned}\begin{aligned} \int_\mathbb{R}g(x,\theta)\,\mathrm{d}\theta&= \int_\mathbb{R}\int_0^\infty f\left(x - \frac{\theta}{\lambda}\right)\,\mathrm{d}G(\lambda)\,\mathrm{d}\theta\\ &= \int_0^\infty \int_\mathbb{R}f\left(x - \frac{\theta}{\lambda}\right)\,\mathrm{d}\theta\,\mathrm{d}G(\lambda)\\ &= \int_0^\infty \int_\mathbb{R}f\left(x - \frac{y\lambda}{\lambda}\right)\,\mathrm{d}\left(y\lambda\right)\,\mathrm{d}G(\lambda)\\ &= \int_0^\infty \int_\mathbb{R}f\left(x - y\right)\,\mathrm{d}y\,\lambda\,\mathrm{d}G(\lambda)\\ &= \int_0^\infty\lambda\,\mathrm{d}G(\lambda). \end{aligned}

If we further stipulate that the expectation of $$GG$$ is unity, this shows that the family $$\mathcal G\mathcal G$$ satisfies the conditions of the question. However, except in special cases, $$\theta\theta$$ is not a location parameter.

Let’s consider the natural follow-up question: when the likelihood is a PDF in the sense of the question, can we always represent the family as a mixture in the foregoing sense?

Unfortunately the answer is no. As a counterexample, consider the family of distribution functions given by

$$f(x,\theta) = 2\left(\left\{\theta\right\} + \left(x – \lfloor \theta \rfloor\right) – 2 \left\{\theta\right\}\left(x – \lfloor \theta \rfloor\right)\right)f(x,\theta) = 2\left(\left\{\theta\right\} + \left(x - \lfloor \theta \rfloor\right) - 2 \left\{\theta\right\}\left(x - \lfloor \theta \rfloor\right)\right)$$

where $$\lfloor \theta \rfloor\lfloor \theta \rfloor$$ is the greatest integer less than or equal to $$\theta\theta$$ and $$\left\{\theta\right\} = \theta – \lfloor \theta \rfloor\left\{\theta\right\} = \theta - \lfloor \theta \rfloor$$ is the fractional part of $$\theta\theta$$ (lying in the interval $$[0,1)[0,1)$$).

This strange looking function describes distributions defined on intervals $$[n,n+1)[n,n+1)$$ (where $$n = \lfloor \theta \rfloorn = \lfloor \theta \rfloor$$) that vary according to the fractional part of $$\theta.\theta.$$ Here are some of their densities: Here is a plot of $$f:f:$$ Now if this family had a location parameter $$\mu = \mu(\theta),\mu = \mu(\theta),$$ we would be able to express each $$f(x,\theta)f(x,\theta)$$ as a fixed function of $$x-\mu(\theta).x-\mu(\theta).$$ Its level sets (contours) would therefore be unions of lines of the form $$x-\mu=\text{constant};x-\mu=\text{constant};$$ that is, of lines with 45 degree slopes. Geometrically, this means we can stretch and compress this image purely in the vertical ($$\theta\theta$$) direction until its bright patches–where the density is nonzero–become a slanted band with parallel linear contours.

No matter how we might re-express the parameter $$\theta\theta$$ (in a continuous fashion), obviously there’s no way it can change this checkered pattern into such an image.