Why is the EM algorithm well suited for exponential families?

I’ve been brushing up on the EM algorithm, and while I feel like I understand the basics, I keep seeing the claim made (e.g. here, here, among several others) that EM works particularly well for exponential families, and I can’t quite see why.

For clarity, let me state what I believe to understand, and I’ll return to what I do not understand at the end.

EM algorithm

Assume i.i.d. observed $$X=X1,…,XnX = X_1, \dots, X_n$$ and latent data $$Z=Z1,…,ZnZ = Z_1, \dots, Z_n$$ from some joint distribution $$p(X,Z∣θ)p(X, Z \mid \theta)$$ parameterised by $$θ\theta$$. I’ll restrict this to discrete $$ZZ$$ to make it easier.

For arbitrary distribution $$q(Z)q(Z)$$, using Jensen’s inequality,

logp(X∣θ)=log∑Zp(X,Z∣θ)=log∑Zq(Z)p(X,Z∣θ)q(Z)≥∑Zq(Z)logp(X,Z∣θ)q(Z)=∑Zq(Z)logp(Z∣X,θ)q(Z)+∑Zq(Z)logp(X∣θ)=logp(X∣θ)−DKL(q(Z)∣∣p(Z∣X,θ))\begin{align*} \log p(X \mid \theta) &= \log \sum_{Z} p(X, Z \mid \theta) \\ &= \log \sum_{Z} q(Z) \frac{p(X, Z \mid \theta)}{q(Z)} \\ &\geq \sum_{Z} q(Z) \log \frac{p(X, Z \mid \theta)}{q(Z)} \\ &= \sum_{Z} q(Z) \log \frac{p(Z \mid X, \theta)}{q(Z)} + \sum_{Z} q(Z) \log p(X \mid \theta) \\ &= \log p(X \mid \theta) - D_{KL}(q(Z) \mid\mid p(Z \mid X, \theta)) \end{align*}

where $$DKLD_{KL}$$ is the KL divergence, and so the bound is tight exactly when $$qq$$ is equal to the posterior distribution of the latent variable, $$q(Z)=p(Z∣X,θ)q(Z) = p(Z \mid X, \theta)$$. Therefore, if I use my current guess $$θold\theta_{\text{old}}$$ to calculate $$qold(Z)=p(Z∣X,θold)q_{\text{old}}(Z) = p(Z \mid X, \theta_{\text{old}})$$, and set

θnew=argmaxθ∑Zqold(Z)logp(X,Z∣θ)qold(Z)=argmaxθ∑Zqold(Z)logp(X,Z∣θ)\begin{align*} \theta_{\text{new}} &= \text{argmax}_\theta \sum_{Z} q_{\text{old}}(Z) \log \frac{p(X, Z \mid \theta)}{q_{\text{old}}(Z)} \\ &= \text{argmax}_\theta \sum_{Z} q_{\text{old}}(Z) \log p(X, Z \mid \theta) \end{align*}

then I am guaranteed a non-decreasing log-likelihood, since

log(X∣θnew)≥∑Zqold(Z)logp(X,Z∣θnew)qold(Z)≥∑Zqold(Z)logp(X,Z∣θold)qold(Z)=log(X∣θold)\begin{align*} \log(X \mid \theta_{\text{new}}) &\geq \sum_{Z} q_{\text{old}}(Z) \log \frac{p(X, Z \mid \theta_{\text{new}})}{q_{\text{old}}(Z)} \\ &\geq \sum_{Z} q_{\text{old}}(Z) \log \frac{p(X, Z \mid \theta_{\text{old}})}{q_{\text{old}}(Z)} \\ &= \log(X \mid \theta_{\text{old}}) \end{align*}

and I can repeat this process until some measure of convergence. So far, so good.

Exponential families

Now, if the joint distribution $$p(X,Z∣θ)=exp(T(X,Z)Tθ−A(θ)+B(X,Z))p(X, Z \mid \theta) = \exp(T(X, Z)^T\theta - A(\theta) + B(X, Z))$$ is on canonical exponential family form, I see that we can do as follows:

θnew=argmaxθ∑Zqold(Z)logp(X,Z∣θ)=argmaxθ∑Zqold(Z)(T(X,Z)Tθ−A(θ))=argmaxθEqold(Z)[(T(X,Z)]Tθ−A(θ)\begin{align*} \theta_{\text{new}} &= \text{argmax}_\theta \sum_{Z} q_{\text{old}}(Z) \log p(X, Z \mid \theta) \\ &= \text{argmax}_\theta \sum_{Z} q_{\text{old}}(Z) (T(X, Z)^T\theta - A(\theta)) \\ &= \text{argmax}_\theta \mathrm{E}_{q_{\text{old}}(Z)} [(T(X, Z)]^T\theta - A(\theta) \end{align*}

Now if we take the derivative wrt. $$θ\theta$$ and set to zero, we get

0=∂∂θEqold(Z)[(T(X,Z)]Tθ−A(θ)=Eqold(Z)[(T(X,Z)]−E[T(X,Z)]\begin{align*} 0 &= \frac{\partial}{\partial \theta} \mathrm{E}_{q_{\text{old}}(Z)} [(T(X, Z)]^T\theta - A(\theta) \\ &= \mathrm{E}_{q_{\text{old}}(Z)} [(T(X, Z)] - \mathrm{E}[T(X, Z)] \end{align*}

which we can solve for $$θ\theta$$. I gather from the reading I’ve been doing that this is important, and I vaguely see that turning the M-step into a function of the conditional expectation of sufficient statistic is nice. However, I can’t actually concretely see why this simplifies the update. Moreover, I cannot find any concrete derivations of a specific EM algorithm (Gaussian mixtures, coin flips, etc.) that appear to put this to use.

So my question is: Why is the EM algorithm particularly well suited for exponential family distributions? If possible, I think seeing an example of where this is used might be helpful.

Answer

I’m not sure that the method necessarily works any more effectively for exponential families (though I’m open to being convinced to the contrary). I think more likely what is meant here is that the method is simpler to apply to exponential families since the maximisation step leads to a relatively simple form. You are essentially already seeing the advantage here; you just aren’t comparing it with anything to see how much nicer it is to have this form instead of groping around in the darkness with functions of unspecified form.

If you try to apply the method to distributions outside the exponential family you will find that you have to proceed ad hoc for the particular functional form at issue, instead of skipping right to using a sufficient statistic. Depending on the complexity of the distribution you are using you might get a reasonable maximising step or you might get a nasty one. In the worst case scenario, the form of the function to be maximised will be complicated enough that you might need to do difficult numerical computations or even a grid search.

Attribution
Source : Link , Question Author : MSR , Answer Author : Ben