I have to deal with a text classification problem. A web crawler crawls webpages of a certain domain and for each webpage I want to find out whether it belongs to only one specific class or not. That is, if I call this class Positive, each crawled webpage belongs either to class Positive or to class Non-Positive.
I already have a large training set of webpages for class Positive. But how to create a training set for class Non-Positive which is as representative as possible? I mean, I could basically use each and everything for that class. Can I just collect some arbitrary pages that definitely do not belong to class Positive? I’m sure the performance of a text classification algorithm (I prefer to make use of a Naive Bayes algorithm) highly depends on which webpages I choose for class Non-Positive.
So what shall I do? Can somebody please give me an advice? Thank you very much!
The Spy EM algorithm solves exactly this problem.
S-EM is a text learning or classification system that learns from a set of positive and unlabeled examples (no negative examples). It is based on a “spy” technique, naive Bayes and EM algorithm.
The basic idea is to combine your positive set with a whole bunch of randomly crawled documents. You initially treat all the crawled documents as the negative class, and learn a naive bayes classifier on that set. Now some of those crawled documents will actually be positive, and you can conservatively relabel any documents that are scored higher than the lowest scoring true positive document. Then you iterate this process until it stablizes.