From what I have seen, the (second-order) Kneser-Ney smoothing formula is in some way or another given as

\begin{align}

P^2_{KN}(w_n|w_{n-1}) &= \frac{\max \left\{ C\left(w_{n-1}, w_n\right) – D, 0\right\}}{\sum_{w’} C\left(w_{n-1}, w’\right)} + \lambda(w_{n-1}) \times P_{cont}(w_n)

\end{align}

with the normalizing factor \lambda(w_{n-1}) given as

\begin{align}

\lambda(w_{n-1}) &= \frac{D}{\sum_{w’} C\left(w_{n-1}, w’\right)} \times N_{1+}\left(w_{n-1}\bullet\right)

\end{align}

and the continuation probability P_{cont}(w_n) of a word w_n

\begin{align}

P_{cont}(w_n) &= \frac{N_{1+}\left(\bullet w_{n}\right)}{\sum_{w’} N_{1+}\left(\bullet w’\right)}

\end{align}

where N_{1+}\left(\bullet w\right) is the number of contexts w was seen in or, simplier, the number of distinct words \bullet that precede the given word w. From what I’ve understood, the formula can be applied recursively.

Now this handles known words in unknown contexts nicely for different n-gram lengths, but what it doesn’t explain is what to do when there are out-of-dictionary words. I tried following this example which states that in the recursion step for unigrams, P_{cont}(/) = P^0_{KN}(/) = \frac{1}{V}. The document then uses this – quoting Chen and Goodman – to justify the above formula as P^1_{KN}(w) = P_{cont}(w).

I fail to see how it works out in the presence of an unknown word w = \text{unknown} though. In these cases P_{cont}(\text{unknown}) = \frac{0}{\text{something}} since, obviously, the unknown word doesn’t continue anything regarding the training set. Likewise the count of n-grams is going to be C\left(w_{n-1}, \text{unknown}\right) = 0.

Furthermore, the whole \sum_{w’} C\left(w_{n-1}, w’\right) term might be zero if a sequence of unknown words – say, a trigram of OOD words – is encountered.

What am I missing?

**Answer**

Dan Jurafsky has published a chapter on N-Gram models which talks a bit about this problem:

At the termination of the recursion, unigrams are interpolated with the uniform distribution:

\begin{align}

P_{KN}(w) = \frac{\max(c_{KN}(w)-d,0)}{\sum_{w’}c_{KN}(w’)}+\lambda(\epsilon)\frac{1}{|V|}

\end{align}

If we want to include an unknown word

`<UNK>`

, it’s just included as a regular vocabulary entry with count zero, and hence its probability will be:

\begin{align}

\frac{\lambda(\epsilon)}{|V|}

\end{align}

I’ve tried to find out what this means, but am not sure if \epsilon just means \lim_{x\rightarrow0}x. If this is the case, and you assume that as the count goes to zero, maybe \lambda(\epsilon) goes to d, according to:

\begin{align}

\lambda(w_{i-1}) = \frac{d}{c(w_{i-1})}\vert\{w:c(w_{i-1},w)>0\}\vert

\end{align}

then the unknown word just gets assigned a fraction of the discount, i.e.:

\begin{align}

\frac{\lambda(\epsilon)}{|V|} = \frac{d}{|V|}

\end{align}

I’m not confident about this answer at all, but wanted to get it out there in case it sparks some more thoughts.

**Update:**

Digging around some more, it seems like \epsilon is typically used to denote the empty string (“”), but it’s still not clear how this affects the calculation of \lambda. \frac{d}{|V|} is still my best guess

**Attribution***Source : Link , Question Author : sunside , Answer Author : abroekhof*