While studying Bayesian statistics, somehow I am facing a problem to understand the differences between prior distribution and prior predictive distribution. Prior distribution is sort of fine to understand but I have found it vague to understand the use of prior predictive distribution and why it is different from prior distribution.

**Answer**

Predictive here means predictive for observations. The prior distribution is a distribution for the parameters whereas the prior predictive distribution is a distribution for the observations.

If X denotes the observations and we use the model (or likelihood) p(x \mid \theta) for \theta \in \Theta then a prior distribution is a distribution for \theta, for example p_\beta(\theta) where \beta is a set of hyperparameters. Note that there’s no conditioning on \beta, and therefore the hyperparameters are considered fixed, which is not the case in hierarchical models but this not the point here.

The *prior predictive* distribution is the distribution of X “averaged” over all possible values of \theta:

\begin{align*}

p_\beta(x) &= \int_\Theta p(x , \theta) d\theta \\

&= \int_\Theta p(x \mid \theta) p_\beta(\theta) d\theta

\end{align*}

This distribution is *prior* as it does not rely on any observation.

We can also define in the same way the *posterior predictive distribution*, that is if we have a sample X = (X_1, \dots, X_n), the posterior predictive distribution is:

\begin{align*}

p_\beta(x \mid X) &= \int_\Theta p(x ,\theta \mid X) d\theta \\

&= \int_\Theta p(x \mid \theta,X) p_\beta(\theta \mid X)d\theta \\

&= \int_\Theta p(x \mid \theta) p_\beta(\theta \mid X)d\theta.

\end{align*}

The last line is based on the assumption that the upcoming observation is independent of X given \theta.

Thus the posterior predictive distribution is constructed the same way as the prior predictive distribution but while in the latter we weight with p_\beta(\theta) in the former we weight with p_\beta(\theta \mid X) that is with our “updated” knowledge about \theta.

**Example : Beta-Binomial**

Suppose our model is X \mid \theta \sim {\rm Bin}(n,\theta) i.e P(X = x \mid \theta) = \theta^x(1-\theta)^{n-x}.

Here \Theta = [0,1].

We also assume a beta prior distribution for \theta, \beta(a,b), where (a,b) is the set of hyper parameters.

The *prior predictive distribution*, p_{a,b}(x), is the beta-binomial distribution with parameters (n,a,b).

This discrete distribution gives the probability of getting k successes out of n trials given the hyper-parameters (a,b) on the probability of success.

Now suppose we observe n_1 draws (x_1, \dots, x_{n_1}) with m successes.

Since the binomial and beta distributions are conjugate distributions we have:

\begin{align*}

p(\theta \mid X=m)

&\propto \theta^m (1 – \theta)^{n_1-m} \times \theta^{a-1}(1-\theta)^{b-1}\\

&\propto \theta^{a+m-1}(1-\theta)^{n_1+b-m-1} \\

&\propto \beta(a+m,n_1+b-m)

\end{align*}

Thus \theta \mid X follows a beta distribution with parameters (a+m,n_1+b-m).

Then, p_{a,b}(x \mid X = m) is also a beta-binomial distribution but this time with parameters (n_2,a+m,b+n_1-m) rather than (n_2,a,b).

Upon a \beta(a,b) prior distribution and a {\rm Bin}(n,\theta) likelihood, if we observe m successes out of n_1 trials, the posterior predictive distribution is a beta-binomial with parameters (n_2,a+x,b+n_1-x). Note that n_2 and n_1 play different roles here, since the posterior predictive distribution is about:

Given my current knowledge on \theta after observing m successes out of n_1 trials, i.e \beta(n_1,a+x,n+b-x), what probability do I have of observing k successes out of n_2 additional trials?

I hope this is useful and clear.

**Attribution***Source : Link , Question Author : oceanus , Answer Author : periwinkle*