Does it makes sense to use feature selection before Random Forest?

Everything is in the title, does it makes sense to use feature selection before using random forest?


Yes it does and it is quite common. If you expect more than ~50% of your features not even are redundant but utterly useless. E.g. the randomForest package has the wrapper function rfcv() which will pretrain a randomForest and omit the least important variables. rfcv function refer to this chapter.
Remember to embed feature selection + modeling in a outer cross-validation loop to avoid over optimistic results.

[edit below]

I could moderate “utterly useless”. A single random forest will most often not as e.g. regression with lasso regularization completely ignore features, even if these (in simulated hindsight) were random features. Decision tree splits by features are chosen by local criteria in any of the thousands or millions of nodes and cannot later be undone.
I do not advocate cutting features down to one superior selection, but it is for some data sets possible to achieve substantial increase in prediction performance (estimated by a repeated outer cross-validation) using this variable selection. A typical finding would be that keeping 100% of features or only few percent work less well, and then there can be a broad middle range with similar estimated prediction performance.

Perhaps a reasonable thumb rule: When one expect that lasso-like regularization would serve better than a ridge-like regularization for a given problem, then one could try pre-training a random forest and rank the features by the inner out-of-bag cross-validated variable importance and try drop some of the least important features. Variable importance quantifies how much the cross-validated model prediction decreases, when a given feature is permuted(values shuffled) after training, before prediction. One will never be certain if one specific feature should be included or not, but it likely much easier to predict by the top 5% features, than the bottom 5%.

From a practical point of view, computational run time could be lowered, and maybe some resources could be saved, if there is a fixed acquisition cost per feature.

Source : Link , Question Author : Marc Lamberti , Answer Author : Soren Havelund Welling

Leave a Comment