# Differences between prior distribution and prior predictive distribution?

While studying Bayesian statistics, somehow I am facing a problem to understand the differences between prior distribution and prior predictive distribution. Prior distribution is sort of fine to understand but I have found it vague to understand the use of prior predictive distribution and why it is different from prior distribution.

Predictive here means predictive for observations. The prior distribution is a distribution for the parameters whereas the prior predictive distribution is a distribution for the observations.

If $$XX$$ denotes the observations and we use the model (or likelihood) $$p(x \mid \theta)p(x \mid \theta)$$ for $$\theta \in \Theta\theta \in \Theta$$ then a prior distribution is a distribution for $$\theta\theta$$, for example $$p_\beta(\theta)p_\beta(\theta)$$ where $$\beta\beta$$ is a set of hyperparameters. Note that there’s no conditioning on $$\beta\beta$$, and therefore the hyperparameters are considered fixed, which is not the case in hierarchical models but this not the point here.

The prior predictive distribution is the distribution of $$XX$$ “averaged” over all possible values of $$\theta\theta$$:

\begin{align*} p_\beta(x) &= \int_\Theta p(x , \theta) d\theta \\ &= \int_\Theta p(x \mid \theta) p_\beta(\theta) d\theta \end{align*}\begin{align*} p_\beta(x) &= \int_\Theta p(x , \theta) d\theta \\ &= \int_\Theta p(x \mid \theta) p_\beta(\theta) d\theta \end{align*}

This distribution is prior as it does not rely on any observation.

We can also define in the same way the posterior predictive distribution, that is if we have a sample $$X = (X_1, \dots, X_n)X = (X_1, \dots, X_n)$$, the posterior predictive distribution is:

\begin{align*} p_\beta(x \mid X) &= \int_\Theta p(x ,\theta \mid X) d\theta \\ &= \int_\Theta p(x \mid \theta,X) p_\beta(\theta \mid X)d\theta \\ &= \int_\Theta p(x \mid \theta) p_\beta(\theta \mid X)d\theta. \end{align*}\begin{align*} p_\beta(x \mid X) &= \int_\Theta p(x ,\theta \mid X) d\theta \\ &= \int_\Theta p(x \mid \theta,X) p_\beta(\theta \mid X)d\theta \\ &= \int_\Theta p(x \mid \theta) p_\beta(\theta \mid X)d\theta. \end{align*}
The last line is based on the assumption that the upcoming observation is independent of $$XX$$ given $$\theta\theta$$.

Thus the posterior predictive distribution is constructed the same way as the prior predictive distribution but while in the latter we weight with $$p_\beta(\theta)p_\beta(\theta)$$ in the former we weight with $$p_\beta(\theta \mid X)p_\beta(\theta \mid X)$$ that is with our “updated” knowledge about $$\theta\theta$$.

Example : Beta-Binomial

Suppose our model is $$X \mid \theta \sim {\rm Bin}(n,\theta)X \mid \theta \sim {\rm Bin}(n,\theta)$$ i.e $$P(X = x \mid \theta) = \theta^x(1-\theta)^{n-x}P(X = x \mid \theta) = \theta^x(1-\theta)^{n-x}$$.
Here $$\Theta = [0,1]\Theta = [0,1]$$.

We also assume a beta prior distribution for $$\theta\theta$$, $$\beta(a,b)\beta(a,b)$$, where $$(a,b)(a,b)$$ is the set of hyper parameters.

The prior predictive distribution, $$p_{a,b}(x)p_{a,b}(x)$$, is the beta-binomial distribution with parameters $$(n,a,b)(n,a,b)$$.

This discrete distribution gives the probability of getting $$kk$$ successes out of $$nn$$ trials given the hyper-parameters $$(a,b)(a,b)$$ on the probability of success.

Now suppose we observe $$n_1n_1$$ draws $$(x_1, \dots, x_{n_1})(x_1, \dots, x_{n_1})$$ with $$mm$$ successes.

Since the binomial and beta distributions are conjugate distributions we have:
\begin{align*} p(\theta \mid X=m) &\propto \theta^m (1 – \theta)^{n_1-m} \times \theta^{a-1}(1-\theta)^{b-1}\\ &\propto \theta^{a+m-1}(1-\theta)^{n_1+b-m-1} \\ &\propto \beta(a+m,n_1+b-m) \end{align*}\begin{align*} p(\theta \mid X=m) &\propto \theta^m (1 - \theta)^{n_1-m} \times \theta^{a-1}(1-\theta)^{b-1}\\ &\propto \theta^{a+m-1}(1-\theta)^{n_1+b-m-1} \\ &\propto \beta(a+m,n_1+b-m) \end{align*}

Thus $$\theta \mid X\theta \mid X$$ follows a beta distribution with parameters $$(a+m,n_1+b-m)(a+m,n_1+b-m)$$.

Then, $$p_{a,b}(x \mid X = m)p_{a,b}(x \mid X = m)$$ is also a beta-binomial distribution but this time with parameters $$(n_2,a+m,b+n_1-m)(n_2,a+m,b+n_1-m)$$ rather than $$(n_2,a,b)(n_2,a,b)$$.

Upon a $$\beta(a,b)\beta(a,b)$$ prior distribution and a $${\rm Bin}(n,\theta){\rm Bin}(n,\theta)$$ likelihood, if we observe $$mm$$ successes out of $$n_1n_1$$ trials, the posterior predictive distribution is a beta-binomial with parameters $$(n_2,a+x,b+n_1-x)(n_2,a+x,b+n_1-x)$$. Note that $$n_2n_2$$ and $$n_1n_1$$ play different roles here, since the posterior predictive distribution is about:

Given my current knowledge on $$\theta\theta$$ after observing $$mm$$ successes out of $$n_1n_1$$ trials, i.e $$\beta(n_1,a+x,n+b-x)\beta(n_1,a+x,n+b-x)$$, what probability do I have of observing $$kk$$ successes out of $$n_2n_2$$ additional trials?

I hope this is useful and clear.