Inspired by this question, I’m wondering whether any work has been done on topic models for large collections of extremely short texts. My intuition is that Twitter should be a natural inspiration for such models. However, from some limited experimentation, it looks like standard topic models (LDA, etc) perform quite poorly on this kind of data.
Does anyone out there know of any work which has been done in this area? This paper talks about applying LDA to Twitter, but I’m really interested in whether there are other algorithms which perform better in the short-document context.
This is a late answer, but it can be useful for other people searching for related research and tools for this problem:
Weiwei Guo from Columbia implemented code for short-text topic modeling. He described the implementation in the paper “Modeling Sentences in the Latent Space” (http://aclweb.org/anthology-new/P/P12/P12-1091v2.pdf) and the code is available here:
Although this is not topic modeling, if you have a classification task involving short pieces of texts, you can use LibShortText. From their web site description
“LibShortText is an open source tool for short-text classification and analysis. It can handle the classification of, for example, titles, questions, sentences, and short messages…”