Text Mining: how to cluster texts (e.g. news articles) with artificial intelligence?

I have built some neural networks (MLP (fully-connected), Elman (recurrent)) for different tasks, like playing Pong, classifying handwritten digits and stuff…

Additionally I tried to build some first convolutional neural networks, e.g. for classifying multi-digit handwritten notes, but I am completely new to analyze and cluster texts, e.g. in image recognition/clustering tasks one can rely on standardized input, like 25×25 sized images, RGB or greyscale and so on…there are plenty of pre-assumption features.

For text mining, for instance news articles, you have an ever changing size of input (different words, different sentences, different text length, …).

How can one implement a modern text mining tool utilizing artificial intelligence, preferably neural networks / SOMs?

Unfortunately I were unable to find simple tutorials to start-off. Complex scientific papers are hard to read and not the best option for learning a topic (as to my opinion). I already read quite some papers about MLPs, dropout techniques, convolutional neural networks and so on, but I were unable to find a basic one about text mining – all I found was far too high level for my very limited text mining skills.

Answer

Latent Dirichlet Allocation (LDA) is great, but if you want something better that uses neural networks I would strongly suggest doc2vec (https://radimrehurek.com/gensim/models/doc2vec.html).

What it does? It works similarly to Google’s word2vec but instead of a single word feature vector you get a feature vector for a paragraph. The method is based on a skip-gram model and neural networks and is considered one of the best methods to extract a feature vector for documents.

Now given that you have this vector you can run k-means clustering (or any other preferable algorithm) and cluster the results.

Finally, to extract the feature vectors you can do it as easy as that:

from gensim.models import Doc2Vec
from gensim.models.doc2vec import LabeledSentence

class LabeledLineSentence(object):
    def __init__(self, filename):
        self.filename = filename
    def __iter__(self):
        for uid, line in enumerate(open(self.filename)):
            yield LabeledSentence(words=line.split(), labels=['TXT_%s' % uid])


sentences = LabeledLineSentence('your_text.txt')

model = Doc2Vec(alpha=0.025, min_alpha=0.025, size=50, window=5, min_count=5,
                dm=1, workers=8, sample=1e-5)

model.build_vocab(sentences)

for epoch in range(500):
    try:
        print 'epoch %d' % (epoch)
        model.train(sentences)
        model.alpha *= 0.99
        model.min_alpha = model.alpha
    except (KeyboardInterrupt, SystemExit):
        break

Attribution
Source : Link , Question Author : daniel451 , Answer Author : Yannis Assael

Leave a Comment