I am applying a random forest algorithm as a classifier on a microarray dataset which are split into two known groups with 1000s of features. After the initial run I look at the importance of the features and run the tree algorithm again with the 5, 10 and 20 most important features. I find that for all features, top 10 and 20 that the OOB estimate of error rate is 1.19% where as for the top 5 features it is 0%. This seems counter-intuitive to me, so I was wondering whether you could explain whether I am missing something or I am using the wrong metric.
I an using the randomForest package in R with ntree=1000, nodesize=1 and mtry=sqrt(n)
This is feature selection overfit and this is pretty known — see Ambroise & McLachlan 2002.
The problem is based on the facts that RF is too smart and number of objects is too small. In the latter case, it is generally pretty easy to randomly create attribute that may have good correlation with the decision. And when the number of attributes is large, you may be certain that some of totally irrelevant ones will be a very good predictors, even enough to form a cluster that will be able to recreate the decision in 100%, especially when the huge flexibility of RF is considered. And so, it becomes obvious that when instructed to find the best possible subset of attributes, the FS procedure finds this cluster.
One solution (CV) is given in A&McL, you can also test our approach to the topic, the Boruta algorithm, which basically extends the set with “shadow attributes” made to be random by design and compares their RF importance to this obtained for real attributes to judge which of them are indeed random and can be removed; this is replicated many times to be significant. Boruta is rather intended to a bit different task, but as far as my tests showed, the resulting set is free of the FS overfit problem.