I was going through the Stan documentation which can be downloaded from here. I was particularly interested in their implementation of the Gelman-Rubin diagnostic. The original paper Gelman & Rubin (1992) define the the potential scale reduction factor (PSRF) as follows:
Let Xi,1,…,Xi,N be the ith Markov chain sampled, and let there be overall M independent chains sampled. Let ˉXi⋅ be the mean from the ith chain, and ˉX⋅⋅ be the overall mean. Define,
And define B
The PSRF is estimate with √ˆR where
The Stan documentation on page 349 ignores the term with df and also removes the (M+1)/M multiplicative term. This is their formula,
The variance estimator is
Finally, the potential scale reduction statistic is defined by
From what I could see, they don’t provide a reference for this change of formula, and neither do they discuss it. Usually M is not too large, and can often be as low so as 2, so (M+1)/M should not be ignored, even if the df term can be approximated with 1.
So where does this formula come from?
I have found a partial answer to the question “where does this formula come from?“, in that the Bayesian Data Analysis book by Gelman, Carlin, Stern, and Rubin (Second edition) has exactly the same formula. However, the book does not explain how/why it is justifiable to ignore those terms?
I followed the specific link given for Gelman & Rubin (1992) and it has
as in the later versions, although ˆσ replaced with ˆσ+ in Brooks & Gelman (1998) and with ^var+ in BDA2 (Gelman et al, 2003) and BDA3 (Gelman et al, 2013).
BDA2 and BDA3 (couldn’t check now BDA1) have an exercise with hints to show that ^var+ is unbiased estimate of the desired quantity.
Gelman & Brooks (1998) has equation 1.1
which can be rearranged as
We can see that the effect of second and third term are negligible for decision making when n is large. See also the discussion in the paragraph before Section 3.1 in Brooks & Gelman (1998).
Gelman & Rubin (1992) also had the term with df as df/(df-2). Brooks & Gelman (1998) have a section describing why this df corretion is incorrect and define (df+3)/(df+1). The paragraph before Section 3.1 in Brooks & Gelman (1998) explains why (d+3)/(d+1) can be dropped.
It seems your source for the equations was something post Brooks & Gelman (1998) as you had (d+3)/(d+1) there and Gelman & Rubin (1992) had df/df(-2). Otherwise Gelman & Rubin (1992) and Brooks & Gelman (1998) have equivalent equations (with slightly different notations and some terms are arranged differently). BDA2 (Gelman, et al., 2003) doesn’t have anymore terms ˆσ+Wm−n−1mn. BDA3 (Gelman et al., 2003) and Stan introduced split chains version.
My interpretation of the papers and experiences using different versions of ˆR is that the terms which have been eventually dropped can be ignored when n is large, even when m is not. I also vaguely remember discussing this with Andrew Gelman years ago, but if you want to be certain of the history, you should ask him.
Usually M is not too large, and can often be as low so as 2
I really do hope that this is not often the case. In cases where you want to use split-ˆR convergence diagnostic, you should use at least 4 chains split and thus have M=8. You may use less chains, if you already know that in your specific cases the convergence and mixing is fast.
- Brooks and Gelman (1998). Journal of Computational and Graphical Statistics, 7(4)434-455.