I have to deal with a text classification problem. A web crawler crawls webpages of a certain domain and for each webpage I want to find out whether it belongs to only one specific class or not. That is, if I call this class

Positive, each crawled webpage belongs either to classPositiveor to classNon-Positive.I already have a large training set of webpages for class

Positive. But how to create a training set for classNon-Positivewhich is as representative as possible? I mean, I could basically use each and everything for that class. Can I just collect some arbitrary pages that definitely do not belong to classPositive? I’m sure the performance of a text classification algorithm (I prefer to make use of a Naive Bayes algorithm) highly depends on which webpages I choose for classNon-Positive.So what shall I do? Can somebody please give me an advice? Thank you very much!

**Answer**

The Spy EM algorithm solves exactly this problem.

S-EM is a text learning or classification system that learns from a set of positive and unlabeled examples (no negative examples). It is based on a “spy” technique, naive Bayes and EM algorithm.

The basic idea is to combine your positive set with a whole bunch of randomly crawled documents. You initially treat all the crawled documents as the negative class, and learn a naive bayes classifier on that set. Now some of those crawled documents will actually be positive, and you can conservatively relabel any documents that are scored higher than the lowest scoring true positive document. Then you iterate this process until it stablizes.

**Attribution***Source : Link , Question Author : pemistahl , Answer Author : rrenaud*