The impression that I got, based on several papers, books and articles that I’ve read, is that the recommended way of fitting a probability distribution on a set of data is by using maximum likelihood estimation (MLE). However, as a physicist, a more intuitive way is to just fit the pdf of the model to the empirical pdf of the data using least squares. Why then is MLE better than least squares in fitting probability distributions? Could someone please point me to a scientific paper/book that answers this question?
My hunch is because MLE does not assume a noise model and the “noise” in the empirical pdf is heteroscedastic and not normal.
One useful way of thinking about this is to note that there are cases when least squares and the MLE are the same eg estimating the parameters where the random element has a normal distribution. So in fact, rather than (as you speculate) that the MLE does not assume a noise model, what is going on is that it does assume there is random noise, but takes a more sophisticated view of how that is shaped rather than assuming it has a normal distribution.
Any text book on statistical inference will deal with the nice properties of MLEs with regard to efficiency and consistency (but not necessarily bias). MLEs also have the nice property of being asymptotically normal themselves under a reasonable set of conditions.