# Using pseudo-priors properly in Bayesian model selection

One approach to model comparison in a Bayesian framework uses a Bernoulli indicator variable to determine which of two models is likely to be the “true model”. When applying MCMC-based tools for fitting such a model, it is common to use pseudo-priors to improve mixing in the chains. See here for a very accessible treatment of why pseudo-priors are useful.

In their seminal paper on the topic, Carlin & Chib (p. 475) state that “the form of [the pseudo-prior] is irrelevant,” which I take to mean that it should not affect posterior inference based on the model (though it might affect MCMC mixing during model fitting). However, my inuition is that the form of the pseudo-prior DOES matter. I asked about this previously in this question. @Xi’an commented (4th comment): “inference about which model is correct does not depend on the pseudo-priors“.

Recently I read comments from Martyn Plummer that contradict my understanding of Carlin & Chib. Martyn says: “In order for the Carlin-Chib method to work, the pseudo-prior must match the posterior when the model is true.

(I am NOT saying that Plummer contradicts Carlin & Chib; only that he contradicts my understanding of Carlin & Chib’s claim).

All of this leaves me with five questions:

1. What is going on here? Provided that the model converges and yields a good effective sample size from the posterior, will my inference about which variables to include in a model depend on my pseudo-prior?
2. If not, how do we square this with my intuition and Plummer’s comment? If so, how do we square this with Carlin & Chib’s paper and Xi’an’s comment (4th comment)?
3. If my understanding of Plummer’s comment is correct, and the pseudo-priors must correspond to the posterior when the variable is included… does this mean it’s impermissible to pseudo-priors corresponding exactly to the true priors? This would mean that pseudo-priors are much more than a convenient technique to improve the mixing in the MCMC!!
4. What if the indicator variable turns on and off a part of the model with several parameters (for example, a random effect with a grand mean, a variance, and n group effects)? Which of the following are permissible (in order of how confident I am that the approach is permissible)? Is there a better approach that I do not list?

i. Use a pseudo-prior that approximates the full joint
posterior distribution of all of the parameters.

ii. If mixing is acceptably non-atrocious, don’t use pseudo-priors at all (i.e. use pseudo-priors equivalent to the true priors).

iii. Use a pseudo-prior based on the univariate posterior distributions for each parameter, but don’t worry about how they are jointly distributed.

iv. Following the apparently plain language of Carlin & Chib, use any pseudo-prior that gives computationally efficient mixing in the MCMC chains, as “the form of [the pseudo-prior] is irrelevant”.

5. What does @Xi’an mean in the first comment on my question in saying “the pseudo-priors need correction in an importance sampling type of correction.

1. What is going on here?

This is a very generic question with the obvious answer to study in details Carlin & Chib (1995). The essential idea is to consider the joint parameter $(m,\theta_1,\theta_2)$ where $m$ denotes the model index ($m=1,2$) and $\theta_1,\theta_2$ the parameters of both models, in the sense that the data comes from the density
$$f(x|m,\theta_1,\theta_2)=f_m(x|\theta_m)$$
i.e. one of the two parameters $\theta_{3-m}$ is superfluous once the model index $m$ is set.

Once this completion is done, a prior has to be chosen on the triplet $(m,\theta_1,\theta_2)$, which is
$$\pi(m,\theta_1,\theta_2)=\pi(m)\pi_m(\theta_m)\tilde\pi_m(\theta_{3-m})$$
where I denote by $\pi(m)$ and $\pi_m(\theta_m)$ the true priors on the model index and on the parameter of each model. The additional $\tilde\pi_m(\theta_{3-m})$ is free because the posterior on $\theta_{3-m}$ is equal to the prior:
$$\pi(m,\theta_1,\theta_2|x)=\pi(m|x)\pi_m(\theta_m|x)\tilde\pi_m(\theta_{3-m})$$
The data does not impact the parameter it does not depend of. And thus inference about $\theta_m$ is not impacted by the choice of $\tilde\pi_m(.)$. In practice, this means that the algorithm of simulating from the augmented model produces

1. a frequency for each model approximating the posterior probability of this model
2. a sequence of parameters $\theta_m$ when $m$ is the model index, to be used for inference on this parameter
3. a sequence of parameters $\theta_{3-m}$ when $m$
is the model index, to be ignored.
1. how do we square this with my intuition and Plummer’s comment?

What Martyn Plummer means in his comment is that the pseudo-prior does not matter on the parameter with the other index $m$ but must be the true prior on the parameter with the current index $3-m$. This is 100% coherent with Carlin & Chib (1995) paper.

1. does this mean it’s impermissible to pseudo-priors corresponding exactly to the true priors?

Pseudo-priors can be taken as the true priors, provided these are proper. But as Carlin & Chib (1995) indicate, it is much more efficient to take an approximation of the true posterior, $\pi_{3-m}(\theta_{3-m}|x)$, approximation that can be obtained by a preliminary MCMC run for each model.

1. What if the indicator variable turns on and off a part of the model with several parameters

The resolution to this conundrum is to consider different sets of parameters for all different models, i.e., to have no common parameters between any two models. If you are in a variable selection problem, this means using a different parameter and a different notation for the coefficient of variable $X_1$ when $X_2$ is part of the regression and when $X_2$ is not part of the regression. From this point, use any pseudo-prior you want on the superfluous parameters.

1. What does @Xi’an mean in the first comment

I mean that if the probabilities of visits to the two models are not the probabilities in the prior, the posterior probability of one model estimated by the simulated frequency must be corrected.