The E-M procedure appears, to the uninitiated, as more or less black magic. Estimate parameters of an HMM (for example) using supervised data. Then decode untagged data, using forward-backward to ‘count’ events as if the data were tagged, more or less. Why does this make the model better? I do know something about the math, but I keep wishing for some sort of mental picture of it.

**Answer**

Just to save some typing, call the observed data X, the missing data Z (e.g. the hidden states of the HMM), and the parameter vector we’re trying to find Q (e.g. transition/emission probabilities).

The intuitive explanation is that we basically cheat, pretend for a moment we know Q so we can find a conditional distribution of Z that in turn lets us find the MLE for Q (ignoring for the moment the fact that we’re basically making a circular argument), then admit that we cheated, put in our new, better value for Q, and do it all over again until we don’t have to cheat anymore.

Slightly more technically, by pretending that we know the real value Q, we can pretend we know something about the conditional distribution of Z|\{X,Q\}, which lets us improve our estimate for Q, which we now pretend is the real value for Q so we can pretend we know something about the conditional distribution of Z|\{X,Q\}, which lets us improve our estimate for Q, which… and so on.

Even more technically, if we knew Z, we could maximize \log(f(Q|X,Z)) and have the right answer. The problem is that we don’t know Z, and any estimate for Q must depend on it. But if we want to find the best estimate (or distribution) for Z, then we need to know X and Q. We’re stuck in a chicken-and-egg situation if we want the unique maximizer analytically.

Our ‘out’ is that — for any estimate of Q (call it Q_n) — we can find the distribution of Z|\{Q_n,X\}, and so we can maximize our *expected* joint log-likelihood of Q|\{X,Z\}, with respect to the conditional distribution of Z|\{Q_n,X\}. This conditional distribution basically tells us how Z depends on the current value of Q given X, and lets us know how to change Q to increase our likelihood for both Q and Z at the same time for a particular value of Q (that we’ve called Q_n). Once we’ve picked out a new Q_{n+1}, we have a different conditional distribution for Z|\{Q_{n+1}, X\} and so have to re-calculate the expectation.

**Attribution***Source : Link , Question Author : bmargulies , Answer Author : Glen_b*