# Why does MAP converge to MLE?

In Kevin Murphy’s “Machine learning: A probabilistic perspective”, chapter 3.2,
the author demonstrates Bayesian concept learning on an example called “number game”: After observing $N$ samples from $\{1,...,100\}$, we want to pick a hypothesis $h$ which best describes the rule that generated the samples. For example “even numbers” or “prime numbers”.

The maximum a-posteriori and maximum likelihood estimates are defined as:

where $p(h)$ represents the prior probabilities of various hypotheses and the posterior is defined as:

iff $\mathcal{D}\subset h$, i.e., how likely it is that uniform sampling with replacement from the hypothesis $h$ would yield set $\mathcal{D}$. Intuitively it means that the posterior is highest for “smallest” hypotheses. For example, hypotheses “powers of 2” explains observations $\{2,4,8,16,64\}$ better than “even numbers”.

All of this is clear. However, I am confused about the following sentence (even though intuitively it makes perfect sense):

Since the likelihood term depends exponentially on $N$, and the prior stays constant, as we get more and more data, the MAP estimate converges towards the maximum likelihood estimate.

It is true that the likelihood depends exponentially on $N$, however, the exponentiated number is in the interval $(0,1)$ and as $N \to \infty$, $x^N \to 0$, so the likelihood should actually vanish.

Why does MAP converge to MLE in this case?

and, although $P(D|h) \to 0$ as $n \to \infty$, as you observe, so does $P(D)$. For a little more concreteness, if we assume two hypotheses $h_1$ and $h_2$, we find the posterior by:
Both the numerator and the denominator have terms raised to the power $N$, so both $\to 0$ as $N \to \infty$, but it should be clear that the normalization required fixes the problem that this would otherwise cause.