I have built some neural networks (MLP (fully-connected), Elman (recurrent)) for different tasks, like playing Pong, classifying handwritten digits and stuff…
Additionally I tried to build some first convolutional neural networks, e.g. for classifying multi-digit handwritten notes, but I am completely new to analyze and cluster texts, e.g. in image recognition/clustering tasks one can rely on standardized input, like 25×25 sized images, RGB or greyscale and so on…there are plenty of pre-assumption features.
For text mining, for instance news articles, you have an ever changing size of input (different words, different sentences, different text length, …).
How can one implement a modern text mining tool utilizing artificial intelligence, preferably neural networks / SOMs?
Unfortunately I were unable to find simple tutorials to start-off. Complex scientific papers are hard to read and not the best option for learning a topic (as to my opinion). I already read quite some papers about MLPs, dropout techniques, convolutional neural networks and so on, but I were unable to find a basic one about text mining – all I found was far too high level for my very limited text mining skills.
Latent Dirichlet Allocation (LDA) is great, but if you want something better that uses neural networks I would strongly suggest doc2vec (https://radimrehurek.com/gensim/models/doc2vec.html).
What it does? It works similarly to Google’s word2vec but instead of a single word feature vector you get a feature vector for a paragraph. The method is based on a skip-gram model and neural networks and is considered one of the best methods to extract a feature vector for documents.
Now given that you have this vector you can run k-means clustering (or any other preferable algorithm) and cluster the results.
Finally, to extract the feature vectors you can do it as easy as that:
from gensim.models import Doc2Vec from gensim.models.doc2vec import LabeledSentence class LabeledLineSentence(object): def __init__(self, filename): self.filename = filename def __iter__(self): for uid, line in enumerate(open(self.filename)): yield LabeledSentence(words=line.split(), labels=['TXT_%s' % uid]) sentences = LabeledLineSentence('your_text.txt') model = Doc2Vec(alpha=0.025, min_alpha=0.025, size=50, window=5, min_count=5, dm=1, workers=8, sample=1e-5) model.build_vocab(sentences) for epoch in range(500): try: print 'epoch %d' % (epoch) model.train(sentences) model.alpha *= 0.99 model.min_alpha = model.alpha except (KeyboardInterrupt, SystemExit): break