E-M, is there an intuitive explanation?

The E-M procedure appears, to the uninitiated, as more or less black magic. Estimate parameters of an HMM (for example) using supervised data. Then decode untagged data, using forward-backward to ‘count’ events as if the data were tagged, more or less. Why does this make the model better? I do know something about the math, but I keep wishing for some sort of mental picture of it.

Just to save some typing, call the observed data $X$, the missing data $Z$ (e.g. the hidden states of the HMM), and the parameter vector we’re trying to find $Q$ (e.g. transition/emission probabilities).
The intuitive explanation is that we basically cheat, pretend for a moment we know $Q$ so we can find a conditional distribution of Z that in turn lets us find the MLE for $Q$ (ignoring for the moment the fact that we’re basically making a circular argument), then admit that we cheated, put in our new, better value for $Q$, and do it all over again until we don’t have to cheat anymore.
Slightly more technically, by pretending that we know the real value $Q$, we can pretend we know something about the conditional distribution of $Z|\{X,Q\}$, which lets us improve our estimate for $Q$, which we now pretend is the real value for $Q$ so we can pretend we know something about the conditional distribution of $Z|\{X,Q\}$, which lets us improve our estimate for $Q$, which… and so on.
Even more technically, if we knew $Z$, we could maximize $\log(f(Q|X,Z))$ and have the right answer. The problem is that we don’t know $Z$, and any estimate for $Q$ must depend on it. But if we want to find the best estimate (or distribution) for $Z$, then we need to know $X$ and $Q$. We’re stuck in a chicken-and-egg situation if we want the unique maximizer analytically.
Our ‘out’ is that — for any estimate of $Q$ (call it $Q_n$) — we can find the distribution of $Z|\{Q_n,X\}$, and so we can maximize our expected joint log-likelihood of $Q|\{X,Z\}$, with respect to the conditional distribution of $Z|\{Q_n,X\}$. This conditional distribution basically tells us how $Z$ depends on the current value of $Q$ given $X$, and lets us know how to change $Q$ to increase our likelihood for both $Q$ and $Z$ at the same time for a particular value of $Q$ (that we’ve called $Q_n$). Once we’ve picked out a new $Q_{n+1}$, we have a different conditional distribution for $Z|\{Q_{n+1}, X\}$ and so have to re-calculate the expectation.