# Example of how the log-sum-exp trick works in Naive Bayes

I have read about the log-sum-exp trick in many places (e.g. here, and here) but have never seen an example of how it is applied specifically to the Naive Bayes classifier (e.g. with discrete features and two classes)

How exactly would one avoid the problem of numerical underflow using this trick?

In

both the denominator and the numerator can become very small, typically because the $p(x_i \vert C_k)$ can be close to 0 and we multiply many of them with each other. To prevent underflows, one can simply take the log of the numerator, but one needs to use the log-sum-exp trick for the denominator.

More specifically, in order to prevent underflows:

• If we only care about knowing which class $(\hat{y})$ the input $(\mathbf{x}=x_1, \dots, x_n)$ most likely belongs to with the maximum a posteriori (MAP) decision rule, we don’t have to apply the log-sum-exp trick, since we don’t have to compute the denominator in that case. For the numerator one can simply take the log to prevent underflows: $log \left( p(\mathbf{x}|Y=C)p(Y=C) \right)$. More specifically:

which becomes after taking the log:

• If we want to compute the class probability $p(Y=C|\mathbf{x})$, we will need to compute the denominator:

The element $\log \left( ~\sum_{k=1}^{|C|}{}p(\mathbf{x}|Y=C_k)p(Y=C_k) \right)\\$ may underflow because $p(x_i \vert C_k)$ can be very small: it is the same issue as in the numerator, but this time we have a summation inside the logarithm, which prevents us from transforming the $p(x_i \vert C_k)$ (can be close to 0) into $\log \left(p(x_i \vert C_k) \right)$ (negative and not close to 0 anymore, since $0 \leq p(x_i \vert C_k) \leq 1$). To circumvent this issue, we can use the fact that $p(x_i \vert C_k) = \exp \left( {\log \left(p(x_i \vert C_k) \right)} \right)$ to obtain:

At that point, a new issue arises: $\log \left( p(\mathbf{x}|Y=C_k)p(Y=C_k) \right)$ may be quite negative, which implies that $\exp \left( \log \left( p(\mathbf{x}|Y=C_k)p(Y=C_k) \right) \right)$ may become very close to 0, i.e. underflow. This is where we use the log-sum-exp trick:

with:

• $a_k=\log \left( p(\mathbf{x}|Y=C_k)p(Y=C_k) \right)$,
• $A = \underset{k \in \{1, \dots, |C|\}} \max a_k.$

We can see that introducing the variable $A$ avoids underflows. E.g. with $k=2, a_1 = - 245, a_2 = - 255$, we have:

• $\exp \left(a_1\right) = \exp \left(- 245\right) =3.96143\times 10^{- 107}$
• $\exp \left(a_2\right) = \exp \left(- 255\right) =1.798486 \times 10^{-111}$

Using the log-sum-exp trick we avoid the underflow, with $A=\max ( -245, -255 )=-245$:
\begin{align}\log \sum_k e^{a_k} &= \log \sum_k e^{a_k}e^{A-A} \\&= A+ \log\sum_k e^{a_k -A}\\ &= -245+ \log\sum_k e^{a_k +245}\\&= -245+ \log \left(e^{-245 +245}+e^{-255 +245}\right) \\&=-245+ \log \left(e^{0}+e^{-10}\right) \end{align}

We avoided the underflow since $e^{-10}$ is much farther away from 0 than $3.96143\times 10^{- 107}$ or $1.798486 \times 10^{-111}$.