I have built some neural networks (MLP (fully-connected), Elman (recurrent)) for different tasks, like playing Pong, classifying handwritten digits and stuff…
Additionally I tried to build some first convolutional neural networks, e.g. for classifying multi-digit handwritten notes, but I am completely new to analyze and cluster texts, e.g. in image recognition/clustering tasks one can rely on standardized input, like 25×25 sized images, RGB or greyscale and so on…there are plenty of pre-assumption features.
For text mining, for instance news articles, you have an ever changing size of input (different words, different sentences, different text length, …).
How can one implement a modern text mining tool utilizing artificial intelligence, preferably neural networks / SOMs?
Unfortunately I were unable to find simple tutorials to start-off. Complex scientific papers are hard to read and not the best option for learning a topic (as to my opinion). I already read quite some papers about MLPs, dropout techniques, convolutional neural networks and so on, but I were unable to find a basic one about text mining – all I found was far too high level for my very limited text mining skills.
Answer
Latent Dirichlet Allocation (LDA) is great, but if you want something better that uses neural networks I would strongly suggest doc2vec (https://radimrehurek.com/gensim/models/doc2vec.html).
What it does? It works similarly to Google’s word2vec but instead of a single word feature vector you get a feature vector for a paragraph. The method is based on a skip-gram model and neural networks and is considered one of the best methods to extract a feature vector for documents.
Now given that you have this vector you can run k-means clustering (or any other preferable algorithm) and cluster the results.
Finally, to extract the feature vectors you can do it as easy as that:
from gensim.models import Doc2Vec
from gensim.models.doc2vec import LabeledSentence
class LabeledLineSentence(object):
def __init__(self, filename):
self.filename = filename
def __iter__(self):
for uid, line in enumerate(open(self.filename)):
yield LabeledSentence(words=line.split(), labels=['TXT_%s' % uid])
sentences = LabeledLineSentence('your_text.txt')
model = Doc2Vec(alpha=0.025, min_alpha=0.025, size=50, window=5, min_count=5,
dm=1, workers=8, sample=1e-5)
model.build_vocab(sentences)
for epoch in range(500):
try:
print 'epoch %d' % (epoch)
model.train(sentences)
model.alpha *= 0.99
model.min_alpha = model.alpha
except (KeyboardInterrupt, SystemExit):
break
Attribution
Source : Link , Question Author : daniel451 , Answer Author : Yannis Assael