Random forest: advantages/disadvantages of selecting randomly subset features for every tree vs for every node?

There are two methods to select subset of features during a tree construction in random forest:

According to Breiman, Leo in “Random Forests”:

“… random forest with random features is formed by selecting at
random, at each node, a small group of input variables to split on.”

Tin Kam Ho used the “random subspace method” where each tree got a random subset of features.

I can imagine that by selecting a subset of features at each node is more superior as the correlated variables can still be involved in the whole tree construction. Whereas if we select a subset of features for each tree, one of the correlated variables will lose its importance.

Are there any other reasons why one method can perform better than the other one?

Answer

The general idea is that both Bagging and Random Forests are methods for variance reduction. This means that they work well with estimators that have LOW BIAS and HIGH VARIANCE (estimators that overfit, to put it simply). Moreover, the averaging of the estimator works best if these are UNCORRELATED from each other.

Decision trees are perfect for this job because, in particolar when fully grown, they can learn very complex interactions (therefore having low bias), but are very sensitive to the input data (high variance).

Both sampling strategies have the goal of reducing the correlation between the trees, which reduces the variance of the averaged ensemble (I suggest Elements of Statistical Learning, Chap. 15 for clarifications).
However, while sampling features at every node still allows the trees to see most variables (in different orders) and learn complex interactions, using a subsample for every tree greatly limits the amount of information that a single tree can learn. This means that trees grown in this fashion are going to be less deep, and with much higher bias, in particular for complex datasets. On the other hand, it is true that trees built this way will tend to be less correlated to each other, as they are often built on completely different subsets of features, but in most scenarios this will not overweight the increase in bias, therefore giving a worse performance on most use cases.

Attribution
Source : Link , Question Author : kkk , Answer Author : Davide ND

Leave a Comment