A Measure Theoretic Formulation of Bayes’ Theorem

I am trying to find a measure theoretic formulation of Bayes’ theorem, when used in statistical inference, Bayes’ theorem is usually defined as:

p\left(\theta|x\right) = \frac{p\left(x|\theta\right) \cdot p\left(\theta\right)}{p\left(x\right)}

where:

  • p\left(\theta|x\right): the posterior density of the parameter.
  • p\left(x|\theta\right): the statistical model (or likelihood).
  • p\left(\theta\right): the prior density of the parameter.
  • p\left(x\right): the evidence.

Now how would we define Bayes’ theorem in a measure theoretic way?

So, I started by defining a probability space:

\left(\Theta, \mathcal{F}_\Theta, \mathbb{P}_\Theta\right)

such that \theta \in \Theta.

I then defined another probability space:

\left(X, \mathcal{F}_X, \mathbb{P}_X\right)

such that x \in X.

From here now on I don’t know what to do, the joint probability space would be:

\left(\Theta \times X, \mathcal{F}_\Theta \otimes \mathcal{F}_X, ?\right)

but I don’t know what the measure should be.

Bayes’ theorem should be written as follow:

? = \frac{? \cdot \mathbb{P}_\Theta}{\mathbb{P}_X}

where:

\mathbb{P}_X = \int_{\theta \in \Theta} ? \space \mathrm{d}\mathbb{P}_\Theta

but as you can see I don’t know the other measures and in which probability space they reside.

I stumbled upon this thread but it was of little help and I don’t know how was the following measure-theoretic generalization of Bayes’ rule reached:

{P_{\Theta |y}}(A) = \int\limits_{x \in A} {\frac{{\mathrm d{P_{\Omega |x}}}}{{\mathrm d{P_\Omega }}}(y)\mathrm d{P_\Theta }}

I’m self-studying measure theoretic probability and lack guidance so excuse my ignorance.

Answer

One precise formulation of Bayes’ Theorem is the following, taken verbatim from Schervish’s Theory of Statistics (1995).

The conditional distribution of \Theta given X=x is called the posterior distribution of \Theta.
The next theorem shows us how to calculate the posterior distribution of a parameter in the case in which there is a measure \nu such that each P_\theta \ll \nu.

Theorem 1.31 (Bayes’ theorem).
Suppose that X has a parametric family \mathcal{P}_0 of distributions with parameter space \Omega.
Suppose that P_\theta \ll \nu for all \theta \in \Omega, and let f_{X\mid\Theta}(x\mid\theta) be the conditional density (with respect to \nu) of X given \Theta = \theta.
Let \mu_\Theta be the prior distribution of \Theta.
Let \mu_{\Theta\mid X}(\cdot \mid x) denote the conditional distribution of \Theta given X = x.
Then \mu_{\Theta\mid X} \ll \mu_\Theta, a.s. with respect to the marginal of X, and the Radon-Nikodym derivative is

\tag{1}
\label{1}
\frac{d\mu_{\Theta\mid X}}{d\mu_\Theta}(\theta \mid x)
= \frac{f_{X\mid \Theta}(x\mid \theta)}{\int_\Omega f_{X\mid\Theta}(x\mid t) \, d\mu_\Theta(t)}

for those x such that the denominator is neither 0 nor infinite.
The prior predictive probability of the set of x values such that the denominator is 0 or infinite is 0, hence the posterior can be defined arbitrarily for such x values.


Edit 1.
The setup for this theorem is as follows:

  1. There is some underlying probability space (S, \mathcal{S}, \Pr) with respect to which all probabilities are computed.
  2. There is a standard Borel space (\mathcal{X}, \mathcal{B}) (the sample space) and a measurable map X : S \to \mathcal{X} (the sample or data).
  3. There is a standard Borel space (\Omega, \tau) (the parameter space) and a measurable map \Theta : S \to \Omega (the parameter).
  4. The distribution of \Theta is \mu_\Theta (the prior distribution); this is the probability measure on (\Omega, \tau) given by \mu_\Theta(A) = \Pr(\Theta \in A) for all A \in \tau.
  5. The distribution of X is \mu_X (the marginal distribution mentioned in the theorem); this is the probability measure on (\mathcal{X}, \mathcal{B}) given by \mu_X(B) = \Pr(X \in B) for all B \in \mathcal{B}.
  6. There is a probability kernel P : \Omega \times \mathcal{B} \to [0, 1], denoted (\theta, B) \mapsto P_\theta(B) which represents the conditional distribution of X given \Theta. This means that

    • for each B \in \mathcal{B}, the map \theta \mapsto P_\theta(B) from \Omega into [0, 1] is measurable,
    • P_\theta is a probability measure on (\mathcal{X}, \mathcal{B}) for each \theta \in \Omega, and
    • for all A \in \tau and B \in \mathcal{B},

      \Pr(\Theta \in A, X \in B) = \int_A P_\theta(B) \, d\mu_\Theta(\theta).

    This is the parametric family of distributions of X given \Theta.

  7. We assume that there exists a measure \nu on (\mathcal{X}, \mathcal{B}) such that P_\theta \ll \nu for all \theta \in \Omega, and we choose a version f_{X\mid\Theta}(\cdot\mid\theta) of the Radon-Nikodym derivative d P_\theta / d \nu (strictly speaking, the guaranteed existence of this Radon-Nikodym derivative might require \nu to be \sigma-finite).
    This means that

    P_\theta(B) = \int_B f_{X\mid\Theta}(x \mid \theta) \, d\nu(x)

    for all B \in \mathcal{B}.
    It follows that

    \Pr(\Theta \in A, X \in B)
    = \int_A \int_B f_{X \mid \Theta}(x \mid \theta) \, d\nu(x) \, d\mu_\Theta(\theta)

    for all A \in \tau and B \in \mathcal{B}. We may assume without loss of generality (e.g., see exercise 9 in Chapter 1 of Schervish’s book) that the map (x, \theta) \mapsto f_{X\mid \Theta}(x\mid\theta) of \mathcal{X}\times\Omega into [0, \infty] is measurable. Then by Tonelli’s theorem we can change the order of integration:

    \Pr(\Theta \in A, X \in B)
    = \int_B \int_A f_{X \mid \Theta}(x \mid \theta) \, d\mu_\Theta(\theta) \, d\nu(x)

    for all A \in \tau and B \in \mathcal{B}.
    In particular, the marginal probability of a set B \in \mathcal{B} is

    \mu_X(B) = \Pr(X \in B)
    = \int_B \int_\Omega f_{X \mid \Theta}(x \mid \theta) \, d\mu_\Theta(\theta) \, d\nu(x),

    which shows that \mu_X \ll \nu, with Radon-Nikodym derivative

    \frac{d\mu_X}{d\nu}
    = \int_\Omega f_{X \mid \Theta}(x \mid \theta) \, d\mu_\Theta(\theta).
  8. There exists a probability kernel \mu_{\Theta \mid X} : \mathcal{X} \times \tau \to [0, 1], denoted (x, A) \mapsto \mu_{\Theta \mid X}(A \mid x), which represents the conditional distribution of \Theta given X (i.e., the posterior distribution).
    This means that

    • for each A \in \tau, the map x \mapsto \mu_{\Theta \mid X}(A \mid x) from \mathcal{X} into [0, 1] is measurable,
    • \mu_{\Theta \mid X}(\cdot \mid x) is a probability measure on (\Omega, \tau) for each x \in \mathcal{X}, and
    • for all A \in \tau and B \in \mathcal{B},

      \Pr(\Theta \in A, X \in B) = \int_B \mu_{\Theta \mid X}(A \mid x) \, d\mu_X(x)

Edit 2.
Given the setup above, the proof of Bayes’ theorem is relatively straightforward.

Proof.
Following Schervish, let

C_0 = \left\{x \in \mathcal{X} : \int_\Omega f_{X \mid \Theta}(x \mid t) \, d\mu_\Theta(t) = 0\right\}

and

C_\infty = \left\{x \in \mathcal{X} : \int_\Omega f_{X \mid \Theta}(x \mid t) \, d\mu_\Theta(t) = \infty\right\}

(these are the sets of potentially problematic x values for the denominator of the right-hand-side of \eqref{1}).
We have

\mu_X(C_0)
= \Pr(X \in C_0)
= \int_{C_0} \int_\Omega f_{X \mid \Theta}(x \mid t) \, d\mu_\Theta(t) \, d\nu(x) = 0,

and

\mu_X(C_\infty)
= \int_{C_\infty} \int_\Omega f_{X \mid \Theta}(x \mid t) \, d\mu_\Theta(t) \, d\nu(x)
= \begin{cases}
\infty, & \text{if $\nu(C_\infty) > 0$,} \\
0, & \text{if $\nu(C_\infty) = 0$.}
\end{cases}

Since \mu_X(C_\infty) = \infty is impossible (\mu_X is a probability measure), it follows that \nu(C_\infty) = 0, whence \mu_X(C_\infty) = 0 as well.
Thus, \mu_X(C_0 \cup C_\infty) = 0, so the set of all x \in \mathcal{X} such that the denominator of the right-hand-side of \eqref{1} is zero or infinite has zero marginal probability.

Next, consider that, if A \in \tau and B \in \mathcal{B}, then

\Pr(\Theta \in A, X \in B)
= \int_B \int_A f_{X \mid \Theta}(x \mid \theta) \, d\mu_\Theta(\theta) \, d\nu(x)

and simultaneously

\begin{aligned}
\Pr(\Theta \in A, X \in B)
&= \int_B \mu_{\Theta \mid X}(A \mid x) \, d\mu_X(x) \\
&= \int_B \left( \mu_{\Theta \mid X}(A \mid x) \int_\Omega f_{X \mid \Theta}(x \mid t) \, d\mu_\Theta(t) \right) \, d\nu(x).
\end{aligned}

It follows that

\mu_{\Theta \mid X}(A \mid x) \int_\Omega f_{X \mid \Theta}(x \mid t) \, d\mu_\Theta(t)
= \int_A f_{X \mid \Theta}(x \mid \theta) \, d\mu_\Theta(\theta)

for all A \in \tau and \nu-a.e. x \in \mathcal{X}, and hence

\mu_{\Theta \mid X}(A \mid x)
= \int_A \frac{f_{X \mid \Theta}(x \mid \theta)}{\int_\Omega f_{X \mid \Theta}(x \mid t) \, d\mu_\Theta(t)} \, d\mu_\Theta(\theta)

for all A \in \tau and \mu_X-a.e. x \in \mathcal{X}.
Thus, for \mu_X-a.e. x \in \mathcal{X}, \mu_{\Theta\mid X}(\cdot \mid x) \ll \mu_\Theta, and the Radon-Nikodym derivative is

\frac{d\mu_{\Theta \mid X}}{d \mu_\Theta}(\theta \mid x)
= \frac{f_{X \mid \Theta}(x \mid \theta)}{\int_\Omega f_{X \mid \Theta}(x \mid t) \, d\mu_\Theta(t)},

as claimed, completing the proof.


Lastly, how do we reconcile the colloquial version of Bayes’ theorem found so commonly in statistics/machine learning literature, namely,

\tag{2}
\label{2}
p(\theta \mid x)
= \frac{p(\theta) p(x \mid \theta)}{p(x)},

with \eqref{1}?

On the one hand, the left-hand-side of \eqref{2} is supposed to represent a density of the conditional distribution of \Theta given X with respect to some unspecified dominating measure on the parameter space.
In fact, none of the dominating measures for the four different densities in \eqref{2} (all named p) are explicitly mentioned.

On the other hand, the left-hand-side of \eqref{1} is the density of the conditional distribution of \Theta given X with respect to the prior distribution.

If, in addition, the prior distribution \mu_\Theta has a density f_\Theta with respect to some (let’s say \sigma-finite) measure \lambda on the parameter space \Omega, then \mu_{\Theta \mid X}(\cdot\mid x) is also absolutely continuous with respect to \lambda for \mu_X-a.e. x \in \mathcal{X}, and if f_{\Theta \mid X} represents a version of the Radon-Nikodym derivative d\mu_{\Theta\mid X}/d\lambda, then \eqref{1} yields

\begin{aligned}
f_{\Theta \mid X}(\theta \mid x)
&= \frac{d \mu_{\Theta \mid X}}{d\lambda}(\theta \mid x) \\
&= \frac{d \mu_{\Theta \mid X}}{d\mu_\Theta}(\theta \mid x) \frac{d \mu_{\Theta}}{d\lambda}(\theta) \\
&= \frac{d \mu_{\Theta \mid X}}{d\mu_\Theta}(\theta \mid x) f_\Theta(\theta) \\
&= \frac{f_\Theta(\theta) f_{X\mid \Theta}(x\mid \theta)}{\int_\Omega f_{X\mid\Theta}(x\mid t) \, d\mu_\Theta(t)} \\
&= \frac{f_\Theta(\theta) f_{X\mid \Theta}(x\mid \theta)}{\int_\Omega f_\Theta(t) f_{X\mid\Theta}(x\mid t) \, d\lambda(t)}.
\end{aligned}

The translation between this new form and \eqref{2} is

\begin{aligned}
p(\theta \mid x) &= f_{\Theta \mid X}(\theta \mid x) = \frac{d \mu_{\Theta \mid X}}{d\lambda}(\theta \mid x), &&\text{(posterior)}\\
p(\theta) &= f_\Theta(\theta) = \frac{d \mu_\Theta}{d\lambda}(\theta), &&\text{(prior)} \\
p(x \mid \theta) &= f_{X\mid\Theta}(x\mid\theta) = \frac{d P_\theta}{d\nu}(x), &&\text{(likelihood)} \\
p(x) &= \int_\Omega f_\Theta(t) f_{X\mid\Theta}(x\mid t) \, d\lambda(t). &&\text{(evidence)}
\end{aligned}

Attribution
Source : Link , Question Author : Blg Khalil , Answer Author : Artem Mavrin

Leave a Comment