# What MCMC algorithms/techniques are used for discrete parameters?

I know a fair amount about fitting continuous parameters particularly gradient-based methods, but not much about fitting discrete parameters.

What are commonly used MCMC algorithms/techniques for fitting discrete parameters? Are there algorithms which are both fairly general and fairly powerful? Are there algorithms which deal with the curse of dimensionality well? For example, I would say Hamiltonian MCMC is general, powerful and scales well.

Sampling from an arbitrary discrete distribution seems more difficult than sampling from a continuous distribution, but I am curious what the state of the art is.

Edit: JMS asked me to elaborate.

I don’t have specific applications in mind, but here are some kinds of models I am imagining:

• Model selection between several kinds of continuous regression models. You have a discrete single ‘model’ parameter
• A continuous model where each observation has a possibility of being an ‘outlier’ and drawn from a much more dispersed distribution. I suppose this is a mixture model.

I would expect many models to include both continuous and discrete parameters.

So the simple answer is yes: Metropolis-Hastings and its special case Gibbs sampling 🙂 General and powerful; whether or not it scales depends on the problem at hand.

I’m not sure why you think sampling an arbitrary discrete distribution is more difficult than an arbitrary continuous distribution. If you can calculate the discrete distribution and the sample space isn’t huge then it’s much, much easier (unless the continuous distribution is standard, perhaps). Calculate the likelihood $f(k)$ for each category, then normalise to get the probabilities $P(\tilde k = k) = f(k)/\sum f(k)$ and use inverse transform sampling (imposing an arbitrary order on $k$).

Have you got a particular model in mind? There are all sorts of MCMC approaches to fitting mixture models, for example, where the latent component assignments are discrete parameters. These range from very simple (Gibbs) to quite complex.

How big is the parameter space? Is it potentially enormous (eg in the mixture model case, it’s N by the number of mixture components)? You might not need anything more than a Gibbs sampler, since conjugacy is no longer an issue (you can get the normalizing constant directly so you can compute the full conditionals). In fact griddy Gibbs used to be popular for these cases, where a continuous prior is discretized to ease computation.

I don’t think there is a particular “best” for all problems having a discrete parameter space any more than there is for the continuous case. But if you tell us more about the models you’re interested in perhaps we can make some recommendations.

Your first example has pretty long history, as you might imagine. A recent-ish review is in [1], see also [2]. I’ll try to give some details here: A relevant example is stochastic search variable selection. The initial formulation was to use absolutely continuous priors like $p(\beta)\sim \pi N(\beta; 0, \tau) + (1-\pi) N(\beta, 0, 1000\tau)$. That actually turns out to work poorly compared to priors like $p(\beta)\sim \pi \delta_0 (\beta) + (1-\pi) N(\beta, 0, \tau)$ where $\delta_0$ is a point mass at 0. Note that both fit into your original formulation; an MCMC approach would usually proceed by augmenting $\beta$ with a (discrete) model indicator (say $Z$). This is equivalent to a model index; if you have $Z_1\dots, Z_p$ then obviously you can remap the $2^p$ possible configurations to numbers in $1:2^p$.

So how can you improve the MCMC? In a lot of these models you can sample from $p(Z, \beta|y)$ by composition, ie using that $p(Z, \beta|y) = p(\beta | Y, Z)p(Z|Y)$. Block updates like this can tremendously improve mixing since the correlation between $Z$ and $\beta$ is now irrelevant to the sampler

SSVS embeds the whole model space in one big model. Often this is easy to implement but gives works poorly. Reversible jump MCMC is a different kind of approach which lets the dimension of the parameter space vary explicitly; see [3] for a review and some practical notes. You can find more detailed notes on implementation in different models in the literature, I’m sure.

Oftentimes a complete MCMC approach is infeasible; say you have a linear regression with $p=1000$ variables and you’re using an approach like SSVS. You can’t hope for your sampler to converge; there’s not enough time or computing power to visit all those model configurations, and you’re especially hosed if some of your variables are even moderately correlated. You should be especially skeptical of people trying to estimate things like variable inclusion probabilities in this way. Various stochastic search algorithms used in conjunction with MCMC have been proposed for such cases. One example is BAS [4], another is in [5] (Sylvia Richardson has other relevant work too); most of the others I’m aware of are geared toward a particular model.

A different approach which is gaining in popularity is to use absolutely continuous shrinkage priors that mimic model averaged results. Typically these are formulated as scale mixtures of normals. The Bayesian lasso is one example, which is a special case of normal-gamma priors and a limiting case of normal-exponential-gamma priors. Other choices include the horseshoe and the general class of normal distributions with inverted beta priors on their variance. For more on these, I’d suggest starting with [6] and walking back through the references (too many for me to replicate here 🙂 )

I’ll add more about outlier models later if I get a chance; the classic reference is [7]. They’re very similar in spirit to shrinkage priors. Usually they’re pretty easy to do with Gibbs sampling.

Perhaps not as practical as you were hoping for; model selection in particular is a hard problem and the more elaborate the model the worse it gets. Block update wherever possible is the only piece of general advice I have. Sampling from a mixture of distributions you will often have the problem that membership indicators and component parameters are highly correlated. I also haven’t touched on label switching issues (or lack of label switching); there is quite a bit of literature there but it’s a little out of my wheelhouse.

Anyway, I think it’s useful to start with some of the references here, to get a feeling for the different ways that others are approaching similar problems.

[1] Merlise Clyde and E. I. George. Model Uncertainty Statistical Science 19 (2004): 81–94.
http://www.isds.duke.edu/~clyde/papers/statsci.pdf

[2]http://www-personal.umich.edu/~bnyhan/montgomery-nyhan-bma.pdf

[3] Green & Hastie Reversible jump MCMC (2009)
http://www.stats.bris.ac.uk/~mapjg/papers/rjmcmc_20090613.pdf

[7] Mike West Outlier models and prior distributions in Bayesian linear regression (1984) JRSS-B