I am trying to find a measure theoretic formulation of Bayes’ theorem, when used in statistical inference, Bayes’ theorem is usually defined as:

p\left(\theta|x\right) = \frac{p\left(x|\theta\right) \cdot p\left(\theta\right)}{p\left(x\right)}

where:

- p\left(\theta|x\right): the
posterior densityof the parameter.

- p\left(x|\theta\right): the
statistical model(orlikelihood).

- p\left(\theta\right): the
prior densityof the parameter.

- p\left(x\right): the
evidence.Now how would we define Bayes’ theorem in a measure theoretic way?

So, I started by defining a probability space:\left(\Theta, \mathcal{F}_\Theta, \mathbb{P}_\Theta\right)

such that \theta \in \Theta.

I then defined another probability space:\left(X, \mathcal{F}_X, \mathbb{P}_X\right)

such that x \in X.

From here now on I don’t know what to do, the joint probability space would be:\left(\Theta \times X, \mathcal{F}_\Theta \otimes \mathcal{F}_X, ?\right)

but I don’t know what the measure should be.

Bayes’ theorem should be written as follow:? = \frac{? \cdot \mathbb{P}_\Theta}{\mathbb{P}_X}

where:

\mathbb{P}_X = \int_{\theta \in \Theta} ? \space \mathrm{d}\mathbb{P}_\Theta

but as you can see I don’t know the other measures and in which probability space they reside.

I stumbled upon this thread but it was of little help and I don’t know how was the following measure-theoretic generalization of Bayes’ rule reached:{P_{\Theta |y}}(A) = \int\limits_{x \in A} {\frac{{\mathrm d{P_{\Omega |x}}}}{{\mathrm d{P_\Omega }}}(y)\mathrm d{P_\Theta }}

I’m self-studying measure theoretic probability and lack guidance so excuse my ignorance.

**Answer**

One precise formulation of Bayes’ Theorem is the following, taken verbatim from Schervish’s *Theory of Statistics* (1995).

The conditional distribution of \Theta given X=x is called the

posterior distributionof \Theta.

The next theorem shows us how to calculate the posterior distribution of a parameter in the case in which there is a measure \nu such that each P_\theta \ll \nu.

Theorem 1.31 (Bayes’ theorem).

Suppose that X has a parametric family \mathcal{P}_0 of distributions with parameter space \Omega.

Suppose that P_\theta \ll \nu for all \theta \in \Omega, and let f_{X\mid\Theta}(x\mid\theta) be the conditional density (with respect to \nu) of X given \Theta = \theta.

Let \mu_\Theta be the prior distribution of \Theta.

Let \mu_{\Theta\mid X}(\cdot \mid x) denote the conditional distribution of \Theta given X = x.

Then \mu_{\Theta\mid X} \ll \mu_\Theta, a.s. with respect to the marginal of X, and the Radon-Nikodym derivative is

\tag{1}

\label{1}

\frac{d\mu_{\Theta\mid X}}{d\mu_\Theta}(\theta \mid x)

= \frac{f_{X\mid \Theta}(x\mid \theta)}{\int_\Omega f_{X\mid\Theta}(x\mid t) \, d\mu_\Theta(t)}

for those x such that the denominator is neither 0 nor infinite.

The prior predictive probability of the set of x values such that the denominator is 0 or infinite is 0, hence the posterior can be defined arbitrarily for such x values.

**Edit 1.**

The setup for this theorem is as follows:

- There is some underlying probability space (S, \mathcal{S}, \Pr) with respect to which all probabilities are computed.
- There is a standard Borel space (\mathcal{X}, \mathcal{B}) (the
*sample space*) and a measurable map X : S \to \mathcal{X} (the*sample*or*data*). - There is a standard Borel space (\Omega, \tau) (the
*parameter space*) and a measurable map \Theta : S \to \Omega (the*parameter*). - The distribution of \Theta is \mu_\Theta (the
*prior distribution*); this is the probability measure on (\Omega, \tau) given by \mu_\Theta(A) = \Pr(\Theta \in A) for all A \in \tau. - The distribution of X is \mu_X (the
*marginal distribution*mentioned in the theorem); this is the probability measure on (\mathcal{X}, \mathcal{B}) given by \mu_X(B) = \Pr(X \in B) for all B \in \mathcal{B}. -
There is a probability kernel P : \Omega \times \mathcal{B} \to [0, 1], denoted (\theta, B) \mapsto P_\theta(B) which represents the conditional distribution of X given \Theta. This means that

- for each B \in \mathcal{B}, the map \theta \mapsto P_\theta(B) from \Omega into [0, 1] is measurable,
- P_\theta is a probability measure on (\mathcal{X}, \mathcal{B}) for each \theta \in \Omega, and
- for all A \in \tau and B \in \mathcal{B},

\Pr(\Theta \in A, X \in B) = \int_A P_\theta(B) \, d\mu_\Theta(\theta).

This is the

*parametric family of distributions of X given \Theta*. - We assume that there exists a measure \nu on (\mathcal{X}, \mathcal{B}) such that P_\theta \ll \nu for all \theta \in \Omega, and we choose a version f_{X\mid\Theta}(\cdot\mid\theta) of the Radon-Nikodym derivative d P_\theta / d \nu (strictly speaking, the guaranteed existence of this Radon-Nikodym derivative might require \nu to be \sigma-finite).

This means that

P_\theta(B) = \int_B f_{X\mid\Theta}(x \mid \theta) \, d\nu(x)

for all B \in \mathcal{B}.

It follows that

\Pr(\Theta \in A, X \in B)

= \int_A \int_B f_{X \mid \Theta}(x \mid \theta) \, d\nu(x) \, d\mu_\Theta(\theta)

for all A \in \tau and B \in \mathcal{B}. We may assume without loss of generality (e.g., see exercise 9 in Chapter 1 of Schervish’s book) that the map (x, \theta) \mapsto f_{X\mid \Theta}(x\mid\theta) of \mathcal{X}\times\Omega into [0, \infty] is measurable. Then by Tonelli’s theorem we can change the order of integration:

\Pr(\Theta \in A, X \in B)

= \int_B \int_A f_{X \mid \Theta}(x \mid \theta) \, d\mu_\Theta(\theta) \, d\nu(x)

for all A \in \tau and B \in \mathcal{B}.

In particular, the marginal probability of a set B \in \mathcal{B} is

\mu_X(B) = \Pr(X \in B)

= \int_B \int_\Omega f_{X \mid \Theta}(x \mid \theta) \, d\mu_\Theta(\theta) \, d\nu(x),

which shows that \mu_X \ll \nu, with Radon-Nikodym derivative

\frac{d\mu_X}{d\nu}

= \int_\Omega f_{X \mid \Theta}(x \mid \theta) \, d\mu_\Theta(\theta).

- There exists a probability kernel \mu_{\Theta \mid X} : \mathcal{X} \times \tau \to [0, 1], denoted (x, A) \mapsto \mu_{\Theta \mid X}(A \mid x), which represents the conditional distribution of \Theta given X (i.e., the
*posterior distribution*).

This means that- for each A \in \tau, the map x \mapsto \mu_{\Theta \mid X}(A \mid x) from \mathcal{X} into [0, 1] is measurable,
- \mu_{\Theta \mid X}(\cdot \mid x) is a probability measure on (\Omega, \tau) for each x \in \mathcal{X}, and
- for all A \in \tau and B \in \mathcal{B},

\Pr(\Theta \in A, X \in B) = \int_B \mu_{\Theta \mid X}(A \mid x) \, d\mu_X(x)

**Edit 2.**

Given the setup above, the proof of Bayes’ theorem is relatively straightforward.

*Proof.*

Following Schervish, let

C_0 = \left\{x \in \mathcal{X} : \int_\Omega f_{X \mid \Theta}(x \mid t) \, d\mu_\Theta(t) = 0\right\}

and

C_\infty = \left\{x \in \mathcal{X} : \int_\Omega f_{X \mid \Theta}(x \mid t) \, d\mu_\Theta(t) = \infty\right\}

(these are the sets of potentially problematic x values for the denominator of the right-hand-side of \eqref{1}).

We have

\mu_X(C_0)

= \Pr(X \in C_0)

= \int_{C_0} \int_\Omega f_{X \mid \Theta}(x \mid t) \, d\mu_\Theta(t) \, d\nu(x) = 0,

and

\mu_X(C_\infty)

= \int_{C_\infty} \int_\Omega f_{X \mid \Theta}(x \mid t) \, d\mu_\Theta(t) \, d\nu(x)

= \begin{cases}

\infty, & \text{if $\nu(C_\infty) > 0$,} \\

0, & \text{if $\nu(C_\infty) = 0$.}

\end{cases}

Since \mu_X(C_\infty) = \infty is impossible (\mu_X is a probability measure), it follows that \nu(C_\infty) = 0, whence \mu_X(C_\infty) = 0 as well.

Thus, \mu_X(C_0 \cup C_\infty) = 0, so the set of all x \in \mathcal{X} such that the denominator of the right-hand-side of \eqref{1} is zero or infinite has zero marginal probability.

Next, consider that, if A \in \tau and B \in \mathcal{B}, then

\Pr(\Theta \in A, X \in B)

= \int_B \int_A f_{X \mid \Theta}(x \mid \theta) \, d\mu_\Theta(\theta) \, d\nu(x)

and simultaneously

\begin{aligned}

\Pr(\Theta \in A, X \in B)

&= \int_B \mu_{\Theta \mid X}(A \mid x) \, d\mu_X(x) \\

&= \int_B \left( \mu_{\Theta \mid X}(A \mid x) \int_\Omega f_{X \mid \Theta}(x \mid t) \, d\mu_\Theta(t) \right) \, d\nu(x).

\end{aligned}

It follows that

\mu_{\Theta \mid X}(A \mid x) \int_\Omega f_{X \mid \Theta}(x \mid t) \, d\mu_\Theta(t)

= \int_A f_{X \mid \Theta}(x \mid \theta) \, d\mu_\Theta(\theta)

for all A \in \tau and \nu-a.e. x \in \mathcal{X}, and hence

\mu_{\Theta \mid X}(A \mid x)

= \int_A \frac{f_{X \mid \Theta}(x \mid \theta)}{\int_\Omega f_{X \mid \Theta}(x \mid t) \, d\mu_\Theta(t)} \, d\mu_\Theta(\theta)

for all A \in \tau and \mu_X-a.e. x \in \mathcal{X}.

Thus, for \mu_X-a.e. x \in \mathcal{X}, \mu_{\Theta\mid X}(\cdot \mid x) \ll \mu_\Theta, and the Radon-Nikodym derivative is

\frac{d\mu_{\Theta \mid X}}{d \mu_\Theta}(\theta \mid x)

= \frac{f_{X \mid \Theta}(x \mid \theta)}{\int_\Omega f_{X \mid \Theta}(x \mid t) \, d\mu_\Theta(t)},

as claimed, completing the proof.

Lastly, how do we reconcile the colloquial version of Bayes’ theorem found so commonly in statistics/machine learning literature, namely,

\tag{2}

\label{2}

p(\theta \mid x)

= \frac{p(\theta) p(x \mid \theta)}{p(x)},

with \eqref{1}?

On the one hand, the left-hand-side of \eqref{2} is supposed to represent a density of the conditional distribution of \Theta given X *with respect to some unspecified dominating measure on the parameter space*.

In fact, none of the dominating measures for the four different densities in \eqref{2} (all named p) are explicitly mentioned.

On the other hand, the left-hand-side of \eqref{1} is the density of the conditional distribution of \Theta given X *with respect to the prior distribution*.

If, in addition, the prior distribution \mu_\Theta has a density f_\Theta with respect to some (let’s say \sigma-finite) measure \lambda on the parameter space \Omega, then \mu_{\Theta \mid X}(\cdot\mid x) is also absolutely continuous with respect to \lambda for \mu_X-a.e. x \in \mathcal{X}, and if f_{\Theta \mid X} represents a version of the Radon-Nikodym derivative d\mu_{\Theta\mid X}/d\lambda, then \eqref{1} yields

\begin{aligned}

f_{\Theta \mid X}(\theta \mid x)

&= \frac{d \mu_{\Theta \mid X}}{d\lambda}(\theta \mid x) \\

&= \frac{d \mu_{\Theta \mid X}}{d\mu_\Theta}(\theta \mid x) \frac{d \mu_{\Theta}}{d\lambda}(\theta) \\

&= \frac{d \mu_{\Theta \mid X}}{d\mu_\Theta}(\theta \mid x) f_\Theta(\theta) \\

&= \frac{f_\Theta(\theta) f_{X\mid \Theta}(x\mid \theta)}{\int_\Omega f_{X\mid\Theta}(x\mid t) \, d\mu_\Theta(t)} \\

&= \frac{f_\Theta(\theta) f_{X\mid \Theta}(x\mid \theta)}{\int_\Omega f_\Theta(t) f_{X\mid\Theta}(x\mid t) \, d\lambda(t)}.

\end{aligned}

The translation between this new form and \eqref{2} is

\begin{aligned}

p(\theta \mid x) &= f_{\Theta \mid X}(\theta \mid x) = \frac{d \mu_{\Theta \mid X}}{d\lambda}(\theta \mid x), &&\text{(posterior)}\\

p(\theta) &= f_\Theta(\theta) = \frac{d \mu_\Theta}{d\lambda}(\theta), &&\text{(prior)} \\

p(x \mid \theta) &= f_{X\mid\Theta}(x\mid\theta) = \frac{d P_\theta}{d\nu}(x), &&\text{(likelihood)} \\

p(x) &= \int_\Omega f_\Theta(t) f_{X\mid\Theta}(x\mid t) \, d\lambda(t). &&\text{(evidence)}

\end{aligned}

**Attribution***Source : Link , Question Author : Blg Khalil , Answer Author : Artem Mavrin*