Intuitively, what does the warning “There were 214 divergent transitions after warmup.” mean?
I understand that the samples obtained are useless, and that increasing adapt_delta, and max_treedepth, and lowering stepsize can help. In addition, reparameterizing can also help.
I would like to know what is actually happening when a divergent transition occurs.
In Section 14.5 Divergent Transitions of the Stan Reference Manual it states “The positions along the simulated trajectory after the Hamiltonian diverges will never be selected as the next draw of the MCMC algorithm”. Does that mean that after a divergent transition, the subsequent transitions are all rejected? Hence the last accepted proposal (and the current state) will be returned many times in the posterior sample?
I have noticed that I am getting many divergent transitions when working with hierarchical models (this is well documented in blogs on the internet), but some of my parameters are (well seem to be) converging and others are not. But, in a joint model, each proposition should be a parameter vector, and if this proposal is rejected after a divergent transition (and all others thereafter as described above), shouldn’t all of the marginal distributions be divergent?
Diagnostics plots seem to show that, say $\mu$ has converged and $\sigma$ has not. One example I had was that $\sigma$ had zero standard deviation. I assumed that this meant that there was a divergent transition after the first proposal and hence all other proposals for $\sigma$ were rejected and consequently, $\sigma$ only had one unique value in its returned sample. But, my sample for $\mu$ was (or seemed) fine. But shouldn’t the proposal $(\mu, \sigma)$ have been rejected after the first proposal and hence the $\mu$ sample should also only have one unique value.
Now consider a model with just one parameter. Imagine that my initial proposition is far from the region(s) of high posterior density, and suppose that this chain never reaches the regions with high posterior density. Is this when Stan returns the message saying “There were x divergent transitions after warmup.”? Since it can see that the MCMC code has not sampled “enough” from the posterior distribution of interest?
One reason that I think this intuition is wrong is because I have ran MCMC code in the past which has not converged (simply due to not being ran for enough iterations) and I have not received this error message.
Can anyone give an intuitive interpretation of what is going on when a transition diverges? What happens if the proposal eventually reaches the region of high posterior density? Is it possible to find where the MCMC sampler diverged and then discard the samples up to when they are convergent again?
If the Hamiltonian diverges the positions along the simulated trajectory will never be accepted. If I have 214 divergent transitions, this must mean that I entered a phase of divergence and then switched back to a convergence phase and then to another phase of divergence. Otherwise, I should only have one divergent transition. This is difficult to understand since according to the manual, no proposals are accepted after a trajectory diverges and so after a trajectory diverges, shouldn’t it stay divergent until the end? Also, why does the MCMC code not just terminate after a divergent trajectory is found? How does x divergent transitions accumulate?
A divergent transition in Stan tells you that the region of the posterior distribution around that divergent transition is geometrically difficult to explore.
For example here is a quote from the manual:
The primary cause of divergent transitions in Euclidean HMC (other than bugs in the code) is highly varying posterior curvature, for which small step sizes are too inefficient in some regions and diverge in other regions. If the step size is too small, the sampler becomes inefficient and halts before making a U-turn (hits the maximum tree depth in NUTS); if the step size is too large, the Hamiltonian simulation diverges.
Basically it means that the Hamiltonian trajectory that Stan proposed is different from what it should be following. So the expected value of say the log(density) that it predicted it should have at a point in the parameter space is different from what it actually is at that point. When Stan detects this problem it knows something has gone wrong and rejects that transition and basically “tries again”. This is demonstrated graphically here: https://dev.to/martinmodrak/taming-divergences-in-stan-models-5762
The reason that you have multiple transitions is that since Stan has rejected that particular transition it will try new ones and those may or may not result in a divergence. Now the reason that you can’t just stop the sampling when encountering the first divergence is that divergences are not always a problem.
For example if you fit a model with idk 10/10,000 transitions diverging and they are randomly distributed across the parameter space then likely there isn’t a problem. If however you end up with these 10 divergent transitions concentrated in a certain part of parameter space (or you have a lot more of them) then it’s likely that your model parameters are not estimated accurately by Stan. In that case you should consider reformulating your model. Basically, divergences are a guide to help make your model better but the existence of a single one doesn’t have to be fatal.
For example page 46 of Betancourt’s Conceptual Introduction to Hamiltonian Monte Carlo (https://arxiv.org/pdf/1701.02434.pdf) shows how divergences can be localized to one part of the parameter space and thus ignoring them/or stopping when you get to them would at best bias your inference (because you’re not including that challenging region).