# What is the Method of Moments and how is it different from MLE?

In general it seems like the method of moments is just matching the observed sample mean, or variance to the theoretical moments to get parameter estimates. This is often the same as MLE for exponential families, I gather.

However, it’s hard to find a clear definition of the method of moments and a clear discussion of why the MLE seems to be generally favored, even though it can be trickier to find the mode of the likelihood function.

This question Is MLE more efficient than Moment method? has a quote from Prof. Donald Rubin (at Harvard) saying that everyone has known since the 40s that MLE beats MoM, but I’d be interested to know the history or reasoning for this.

What is the method of moments?

https://en.m.wikipedia.org/wiki/Method_of_moments_(statistics)

It means that you are estimating the population parameters by selecting the parameters such that the population distribution has the moments that are equivalent to the observed moments in the sample.

How is it different from MLE

The maximum likelihood estimate minimizes the likelihood function. In some cases this minimum can sometimes be expressed in terms of setting the population parameters equal to the sample parameters.

E.g. when estimating the mean parameter of a distribution and employ MLE then often we end up with using $$\mu = \bar{x}$$. However this does not need to be always the case ( related: https://stats.stackexchange.com/a/317631/164061 although in the case of the example there, the Poisson distribution, the MLE and MoM estimate coincide, and the same is true for many others). For example the MLE solution for the estimate of $$\mu$$ in a log normal distribution is:

$$\mu = 1/n \sum ln (x_i) = \overline {ln (x)}$$

Whereas the MoM solution is solving

$$exp (\mu + \frac {1}{2}\sigma^2) = \bar {x}$$ leading to
$$\mu = ln (\bar {x}) – \frac {1}{2} \sigma^2$$

So the MoM is a practical way to estimate the parameters, leading often to the exact same result as the MLE (since the moments of the sample often coincide with the moments of the population, e.g. a sample mean is distributed around the population mean, and up to some factor/bias, it works out very well). The MLE has a stronger theoretical foundation and for instance allows estimation of errors using the Fisher matrix (or estimates of it), and it is a much more natural approach in the case of regression problems (I haven’t tried it but I guess that a MoM for solving parameters in a simple linear regression is not working easily and may give bad results. In the answer by superpronker it seems like this is done by some minimization of a function. For MLE this minimization expresses higher probability, but I wonder whether it represents such a similar thing for MoM).