# Why is the EM algorithm well suited for exponential families?

I’ve been brushing up on the EM algorithm, and while I feel like I understand the basics, I keep seeing the claim made (e.g. here, here, among several others) that EM works particularly well for exponential families, and I can’t quite see why.

For clarity, let me state what I believe to understand, and I’ll return to what I do not understand at the end.

EM algorithm

Assume i.i.d. observed $$X=X1,…,XnX = X_1, \dots, X_n$$ and latent data $$Z=Z1,…,ZnZ = Z_1, \dots, Z_n$$ from some joint distribution $$p(X,Z∣θ)p(X, Z \mid \theta)$$ parameterised by $$θ\theta$$. I’ll restrict this to discrete $$ZZ$$ to make it easier.

For arbitrary distribution $$q(Z)q(Z)$$, using Jensen’s inequality,

logp(X∣θ)=log∑Zp(X,Z∣θ)=log∑Zq(Z)p(X,Z∣θ)q(Z)≥∑Zq(Z)logp(X,Z∣θ)q(Z)=∑Zq(Z)logp(Z∣X,θ)q(Z)+∑Zq(Z)logp(X∣θ)=logp(X∣θ)−DKL(q(Z)∣∣p(Z∣X,θ))\begin{align*} \log p(X \mid \theta) &= \log \sum_{Z} p(X, Z \mid \theta) \\ &= \log \sum_{Z} q(Z) \frac{p(X, Z \mid \theta)}{q(Z)} \\ &\geq \sum_{Z} q(Z) \log \frac{p(X, Z \mid \theta)}{q(Z)} \\ &= \sum_{Z} q(Z) \log \frac{p(Z \mid X, \theta)}{q(Z)} + \sum_{Z} q(Z) \log p(X \mid \theta) \\ &= \log p(X \mid \theta) - D_{KL}(q(Z) \mid\mid p(Z \mid X, \theta)) \end{align*}

where $$DKLD_{KL}$$ is the KL divergence, and so the bound is tight exactly when $$qq$$ is equal to the posterior distribution of the latent variable, $$q(Z)=p(Z∣X,θ)q(Z) = p(Z \mid X, \theta)$$. Therefore, if I use my current guess $$θold\theta_{\text{old}}$$ to calculate $$qold(Z)=p(Z∣X,θold)q_{\text{old}}(Z) = p(Z \mid X, \theta_{\text{old}})$$, and set

θnew=argmaxθ∑Zqold(Z)logp(X,Z∣θ)qold(Z)=argmaxθ∑Zqold(Z)logp(X,Z∣θ)\begin{align*} \theta_{\text{new}} &= \text{argmax}_\theta \sum_{Z} q_{\text{old}}(Z) \log \frac{p(X, Z \mid \theta)}{q_{\text{old}}(Z)} \\ &= \text{argmax}_\theta \sum_{Z} q_{\text{old}}(Z) \log p(X, Z \mid \theta) \end{align*}

then I am guaranteed a non-decreasing log-likelihood, since

log(X∣θnew)≥∑Zqold(Z)logp(X,Z∣θnew)qold(Z)≥∑Zqold(Z)logp(X,Z∣θold)qold(Z)=log(X∣θold)\begin{align*} \log(X \mid \theta_{\text{new}}) &\geq \sum_{Z} q_{\text{old}}(Z) \log \frac{p(X, Z \mid \theta_{\text{new}})}{q_{\text{old}}(Z)} \\ &\geq \sum_{Z} q_{\text{old}}(Z) \log \frac{p(X, Z \mid \theta_{\text{old}})}{q_{\text{old}}(Z)} \\ &= \log(X \mid \theta_{\text{old}}) \end{align*}

and I can repeat this process until some measure of convergence. So far, so good.

Exponential families

Now, if the joint distribution $$p(X,Z∣θ)=exp(T(X,Z)Tθ−A(θ)+B(X,Z))p(X, Z \mid \theta) = \exp(T(X, Z)^T\theta - A(\theta) + B(X, Z))$$ is on canonical exponential family form, I see that we can do as follows:

θnew=argmaxθ∑Zqold(Z)logp(X,Z∣θ)=argmaxθ∑Zqold(Z)(T(X,Z)Tθ−A(θ))=argmaxθEqold(Z)[(T(X,Z)]Tθ−A(θ)\begin{align*} \theta_{\text{new}} &= \text{argmax}_\theta \sum_{Z} q_{\text{old}}(Z) \log p(X, Z \mid \theta) \\ &= \text{argmax}_\theta \sum_{Z} q_{\text{old}}(Z) (T(X, Z)^T\theta - A(\theta)) \\ &= \text{argmax}_\theta \mathrm{E}_{q_{\text{old}}(Z)} [(T(X, Z)]^T\theta - A(\theta) \end{align*}

Now if we take the derivative wrt. $$θ\theta$$ and set to zero, we get

0=∂∂θEqold(Z)[(T(X,Z)]Tθ−A(θ)=Eqold(Z)[(T(X,Z)]−E[T(X,Z)]\begin{align*} 0 &= \frac{\partial}{\partial \theta} \mathrm{E}_{q_{\text{old}}(Z)} [(T(X, Z)]^T\theta - A(\theta) \\ &= \mathrm{E}_{q_{\text{old}}(Z)} [(T(X, Z)] - \mathrm{E}[T(X, Z)] \end{align*}

which we can solve for $$θ\theta$$. I gather from the reading I’ve been doing that this is important, and I vaguely see that turning the M-step into a function of the conditional expectation of sufficient statistic is nice. However, I can’t actually concretely see why this simplifies the update. Moreover, I cannot find any concrete derivations of a specific EM algorithm (Gaussian mixtures, coin flips, etc.) that appear to put this to use.

So my question is: Why is the EM algorithm particularly well suited for exponential family distributions? If possible, I think seeing an example of where this is used might be helpful.