I am using KL Divergence as a measure of dissimilarity between 2 p.m.f. P and Q.
then we can easily calculate that P(Xi)ln(Q(Xi))=0 P(Xi)ln(P(Xi))=0
But if P(Xi)≠0 and Q(Xi)=0
how to calculate the P(Xi)ln(Q(Xi))
You can’t and you don’t.
Imagine that you have an random variable of probability distribution Q. But your friend Bob thinks that the outcome comes from the probability distribution P. He has constructed an optimal encoding, that minimizes the number of expected bits he will need to use to tell you the outcome. But, since he constructed the encoding from P and not from Q, his codes will be longer than necessary. KL-divergence measure how much longer the codes will be.
Now lets say he has a coin and he wants to tell you the sequence of outcomes he gets. Because head and tail are equally likely he gives them both 1-bit codes. 0 for head, 1 for tail. If he gets tail tail head tail, he can send 1 1 0 1.
Now, if his coin lands on the edge he cannot possibly tell you! No code he sends you would work. At this point KL-divergence breaks down.
Since KL-divergence breaks down you will either have to use another measure or other probability distributions. What you should do really depends on what you want. Why are you comparing probability distributions? Where do your probability distributions come from, are they estimated from data?
You say your probability distributions come from natural language documents somehow, and you want to compare pairs of categories.
First, I’d recommend a symmetric relatedness measure. For this application it sounds like A to be as similar to B as B is similar to A.
Have you tried the cosine similarity measure? It is quite common in NLP.
If you want to stick with KL, one thing you could do is estimate a probability function from both documents and then see how how many extra bits you’d need on average for either document. That is (P||(P+Q)/2 + Q||(P+Q)/2)/2