I am looking to do classification on my text data. I have
300 classes, 200 training documents per class (so
60000 documents in total) and this is likely to result in very high dimensional data (we may be looking in excess of 1million dimensions).
I would like to perform the following steps in the pipeline (just to give you a sense of what my requirements are):
- Converting each document to feature vector (
vector space model)
Mutual Informationbased preferably, or any other standard ones)
- Training the classifier (
- Predicting unseen data based on the classifier model trained.
So the question is what tools/framework do I use for handling such high dimensional data? I am aware of the usual suspects (R, WEKA…) but as far as my knowledge goes (I may be wrong) possibly none of them can handle data this large. Is there any other off the shelf tool that I could look at?
If I have to parallelize it, should I be looking at Apache Mahout? Looks like it may not quite yet provide the functionality I require.
Thanks to all in advance.
Update: I looked around this website, R mailing list and the internet in general. It appears to me that the following problems could emerge in my situation:
(2) Since I will need to use an ensemble of R packages (pre-processing, sparse matrices, classifiers etc.) interoperability between the packages could become a problem, and I may incur an additional overhead in converting data from one format to another. For example, if I do my pre-processing using
tm(or an external tool like WEKA) I will need to figure out a way to convert this data into a form that the HPC libraries in R can read. And again it is not clear to me if the classifier packages would directly take in the data as provided by the HPC libraries.
Am I on the right track? And more importantly, am I making sense ?
This should be possible to make it work as long as the data is represented as a sparse data structure such as
scipy.sparse.csr_matrix instance in Python. I wrote a tutorial for working on text data. It is further possible to reduce the memory usage further by leveraging the hashing trick: adapt it to use the
HashingVectorizer instead of the
CountingVectorizer or the
TfidfVectorizer. This is explained in the documentation section text features extraction.
Random Forests are in general much more expensive than linear models (such as linear support vector machines and logistic regression) and multinomial or Bernoulli naive Bayes and for most text classification problems that do not bring significantly better predictive accuracy than simpler models.
If scikit-learn ends up not being able to scale to your problem, Vowpal Wabbit will do (and probably faster than sklearn) albeit it does not implement all the models your are talking about.
Edited in April 2015 to reflect the current state of the scikit-learn library and to fix broken links.