I’ve been brushing up on the EM algorithm, and while I feel like I understand the basics, I keep seeing the claim made (e.g. here, here, among several others) that EM works particularly well for exponential families, and I can’t quite see why.
For clarity, let me state what I believe to understand, and I’ll return to what I do not understand at the end.
Assume i.i.d. observed X=X1,…,Xn and latent data Z=Z1,…,Zn from some joint distribution p(X,Z∣θ) parameterised by θ. I’ll restrict this to discrete Z to make it easier.
For arbitrary distribution q(Z), using Jensen’s inequality,
where DKL is the KL divergence, and so the bound is tight exactly when q is equal to the posterior distribution of the latent variable, q(Z)=p(Z∣X,θ). Therefore, if I use my current guess θold to calculate qold(Z)=p(Z∣X,θold), and set
then I am guaranteed a non-decreasing log-likelihood, since
and I can repeat this process until some measure of convergence. So far, so good.
Now, if the joint distribution p(X,Z∣θ)=exp(T(X,Z)Tθ−A(θ)+B(X,Z)) is on canonical exponential family form, I see that we can do as follows:
Now if we take the derivative wrt. θ and set to zero, we get
which we can solve for θ. I gather from the reading I’ve been doing that this is important, and I vaguely see that turning the M-step into a function of the conditional expectation of sufficient statistic is nice. However, I can’t actually concretely see why this simplifies the update. Moreover, I cannot find any concrete derivations of a specific EM algorithm (Gaussian mixtures, coin flips, etc.) that appear to put this to use.
So my question is: Why is the EM algorithm particularly well suited for exponential family distributions? If possible, I think seeing an example of where this is used might be helpful.
I’m not sure that the method necessarily works any more effectively for exponential families (though I’m open to being convinced to the contrary). I think more likely what is meant here is that the method is simpler to apply to exponential families since the maximisation step leads to a relatively simple form. You are essentially already seeing the advantage here; you just aren’t comparing it with anything to see how much nicer it is to have this form instead of groping around in the darkness with functions of unspecified form.
If you try to apply the method to distributions outside the exponential family you will find that you have to proceed ad hoc for the particular functional form at issue, instead of skipping right to using a sufficient statistic. Depending on the complexity of the distribution you are using you might get a reasonable maximising step or you might get a nasty one. In the worst case scenario, the form of the function to be maximised will be complicated enough that you might need to do difficult numerical computations or even a grid search.