I have used LDA on a corpus of documents and found some topics. The output of my code is two matrices containing probabilities; one doc-topic probabilities and the other word-topic probabilities. But I actually don’t know how to use these results to predict the topic of a new document. I am using Gibbs sampling. Does anyone know how? thanks
I’d try ‘folding in’. This refers to taking one new document, adding it to the corpus, and then running Gibbs sampling just on the words in that new document, keeping the topic assignments of the old documents the same. This usually converges fast (maybe 5-10-20 iterations), and you don’t need to sample your old corpus, so it also runs fast. At the end you will have the topic assignment for every word in the new document. This will give you the distribution of topics in that document.
In your Gibbs sampler, you probably have something similar to the following code:
// This will initialize the matrices of counts, N_tw (topic-word matrix) and N_dt (document-topic matrix) for doc = 1 to N_Documents for token = 1 to N_Tokens_In_Document Assign current token to a random topic, updating the count matrices end end // This will do the Gibbs sampling for doc = 1 to N_Documents for token = 1 to N_Tokens_In_Document Compute probability of current token being assigned to each topic Sample a topic from this distribution Assign the token to the new topic, updating the count matrices end end
Folding-in is the same, except you start with the existing matrices, add the new document’s tokens to them, and do the sampling for only the new tokens. I.e.:
Start with the N_tw and N_dt matrices from the previous step // This will update the count matrices for folding-in for token = 1 to N_Tokens_In_New_Document Assign current token to a random topic, updating the count matrices end // This will do the folding-in by Gibbs sampling for token = 1 to N_Tokens_In_New_Document Compute probability of current token being assigned to each topic Sample a topic from this distribution Assign the token to the new topic, updating the count matrices end
If you do standard LDA, it is unlikely that an entire document was generated by one topic. So I don’t know how useful it is to compute the probability of the document under one topic. But if you still wanted to do it, it’s easy. From the two matrices you get you can compute piw, the probability of word w in topic i. Take your new document; suppose the j‘th word is wj. The words are independent given the topic, so the probability is just ∏jpiwj (note that you will probably need to compute it in log space).