What is a good method for short text clustering?

I am working on a text clustering problem. The data contains several sentences. Is there a good algorithm which reaches high accuracy on short text?

Can you provide good references?

Algorithms such as KMeans, spectral clustering does not work well for this problem.


That mostly depends on how much “state-of-the-art” (SOTA) you want versus how deep you wish to go (pun intended…).

If you can live with just shallow word embeddings as provided by word2vec, Glove, or fastText, I think the Word Mover Distance (WMD [yes, really…]) is a nice function for measuring (short) document distances [1]. I’ve even seen several Python Notebooks in the past that provide “tutorials” for this distance measure, so its really easy to get going.

However, if you are more interested in SOTA, you will have to look into deep (sequence representation) learning, using some kind of recurrent network that learns a topic model from your sentences. In addition to integrating (semantic) embeddings of words, these approaches go beyond the [good, old] “bag-of-words” approach by learning topic representations using the dependencies of the words in the sentence[s]. For example, the Sentence Level Recurrent Topic Model (SLRTM) is a pretty interesting deep, recurrent model based on the ideas of the more traditional LDA (by Blei et al.) or LSA (Landauer et al.), but it’s only an arXiv paper (so all default “take-this-with-a-grain-of-salt warnings” about non-peer-reviewed research should apply…) [2]. None the less, the paper has many excellent pointer and references to get your research started should you want to go down this rabbit hole.

Finally, it should be clarified that I don’t claim that these are the agreed upon best-performing methods for bag-of-words and sequence models, respectively. But they should get you pretty close to whatever the “best” SOTA might be, and at least should serve as a an excellent starting point.

[1] Matt J. Kusner et al. From Word Embeddings To Document Distances. Proceedings of the 32nd International Conference on Machine Learning, JMLR, 2015.

[2] Fei Tian et al. SLRTM: Letting Topics Speak for Themselves. arXiv 1604.02038, 2016.

Source : Link , Question Author : user3108764 , Answer Author : fnl

Leave a Comment