Rao-Blackwellization in variational inference

The Black box VI paper introduces Rao-Blackwellization as a method to reduce the variance of the gradient estimator using score function, in section 3.1.

However I don’t quite get the basic idea behind those formulas, please give me some hint and help!


To make this question more self-contained, I’ll try to put in more details (also some thoughts of my own).

Suppose I have a 2d Gaussian dataset $$X \sim N(\mu, P^{-1})$$, and the mean is known to be $\mu = (0,0)$, but the precision matrix $P$ is unknown, and I want to estimate $P$ using variational inference, that means we need to find a variational distribution $q(P)$ to approximate the true (unknown) posterior distribution $p(P|X)$, which is a KL div $kl(q\|p)$, and this KL div objective could be reformulated as a proxy objective, i.e. ELBO, which is
$$L_{ELBO} = E_{q(P)}[\log p(X,P) – \log q(P)]$$
and in my problem we have

p(X|P) \sim N(0,P^{-1}); & \qquad \text{likelihood as Gaussian} \\
p(P) \sim W(d_0,S_0); & \qquad \text{prior for P as Wishart} \\
q(P) \sim W(d,S); & \qquad \text{variational distribution for P as Wishart}

, now the problem comes down to optimizing $L_{ELBO}$ to find the best variational parameters of $q(P)$, i.e. $d,S$.

We compute the gradient of loss w.r.t. to $d$ and $S$, so that we could do a gradient ascent update to optimize $L$, now here comes the general gradient formula of $ELBO$ w.r.t. variational parameters (see detail of derivation)
$$\nabla_{\lambda}L = E_{q}[\nabla_{\lambda}\log q(P|\lambda)\cdot(\log p(X,P)-\log q(P|\lambda))]$$
here $\lambda$ means the variational parameters for short.

Given this gradient formula, we iteratively draw samples of $P$ from $q(P|\lambda)$, compute $\nabla_\lambda L$ for each sample and average them as a noisy estimate for the real gradient, finally apply gradient ascent over the variational parameters and repeat this process until convergence, that is
$$\nabla_{\lambda}L \approx \frac{1}{n\_sample} \sum_{i=1}^{n\_sample} [\nabla_{\lambda}\log q(P_i|\lambda)\cdot(\log p(X,P_i)-\log q(P_i|\lambda)]$$

and this particular noisy estimate could have high variance, so here finally comes my question, as I read in the paper, Rao-Blackwellization could be used when we have multiple latent variables, but here I just have one (i.e. $P$), how do we use Rao-Blackwellization to reduce the variance?

Please help and correct me if anything wrong!


Source : Link , Question Author : avocado , Answer Author : Community

Leave a Comment