Why does thinning work in Bayesian inference?

In Bayesian inference, one needs to determine the posterior distribution of the parameters from the prior distribution and the likelihood of the data. As this computation might not be possible analytically, simulation methods may be required.

In MCMC (Markov Chain Monte Carlo) algorithms, a Markov chain is generated, whose limit distribution is the desired posterior distribution. In practice, it might be difficult to assess whether convergence has been achieved. When you stop a Markov chain at a finite step, you do not have independent realizations, as each generated point depends on the previous ones. The thing is that, as the chain advances, such dependence will be lower and lower, and at infinity you would obtain independent realizations from the posterior.

Thus, let us assume that we have stopped the Markov chain at a finite step, and that the sample obtained has significant autocorrelation yet. We do not have independent draws from the posterior distribution. Thinning consists in picking separated points from the sample, at each $k$-th step. As we are separating the points from the Markov chain, the dependence becomes smaller and we achieve some sort of independent sample. But what I do not understand about this procedure is that, although we have an (approximately) independent sample, we are not still simulating from the posterior distribution; otherwise the whole sample would have present independence.

So in my view, thinning gives more independence, which is certainly necessary to approximate statistics via Monte Carlo simulation and the law of large numbers. But it does not accelerate the encounter with the posterior distribution. At least, I do not know any mathematical evidence about the latter fact. So, actually, we have gained nothing (apart from less storage and memory demand). Any insight on this issue would be appreciated.

Answer

Thinning has nothing to do with Bayesian inference, but everything to do with computer-based pseudo-random simulation.

The whole point in generating a Markov chain $(\theta_t)$ via MCMC algorithms is to achieve more easily simulations from the posterior distribution, $\pi(\cdot)$. However, the penalty for doing so is creating correlation between the simulations. (With respect to the question, this correlation persists even asymptotically in $t$.) By subsampling or thinning out the Markov chain $(\theta_t)$, this correlation is usually (but not always) reduced as the thinning interval grows.

Thinning has however nothing to do with convergence of the Markov chain to the stationary distribution $\pi(\cdot)$ since it is a post-processing of the simulated Markov chain $(\theta_t)$. Thinning only makes sense once the chain is (approximately) stationary. Removing early values of the Markov chain to eliminate the impact of the starting value is called burning or warmup.

Note furthermore that thinning is rarely helpful when considering approximations of posterior expectations (by the Ergodic Theorem)
$$\frac{1}{T}\sum_{t=}^T h(\theta_t) \longrightarrow \int h(\theta(\pi(\theta)\text{d}\theta$$
since using the entire (unthinned) chain most often reduces the variance of the approximation. If specific needs call for an almost iid sample from $\pi(\cdot)$, thinning may appeal, but except for specific situations where renewal can be implemented, there is no guarantee that the sample will be either “i” or “id”… The alternative solution of running several chains independently in parallel produces independent samples but again with rarely a guarantee that the points are exactly distributed from $\pi(\cdot)$.

Attribution
Source : Link , Question Author : user269666 , Answer Author : Xi’an

Leave a Comment