# In Random Forest, why is a random subset of features chosen at the node level rather than at the tree level?

My Question: Why does random forest consider random subsets of features for splitting at the node level within each tree rather than at the tree level?

Background: This is something of a history question. Tin Kam Ho published this paper on constructing “decision forests” by randomly selecting a subset of features to use for growing each tree in 1998. Several years later, in 2001, Leo Breiman published his seminal Random Forest paper, wherein the feature subset is randomly selected at each node within each tree, not at each tree. While Breiman cited Ho, he did not specifically explain the move from tree-level to node-level random feature selection.

I’m wondering what specifically motivated this development. It seems that selecting the feature subset at the tree level would still accomplish the desired decorrelation of the trees.

My theory: I haven’t seen this articulated elsewhere, but it seems like the random subspace method would be less efficient in terms of getting estimates of feature importance. To obtain estimates of variable importance, for each tree, the features are randomly permuted one by one, and the increase in misclassification or increase in error for the out-of-bag observations is recorded. The variables for which the misclassification or error increase resulting from this random permutation is high are those with the greatest importance.

If we use the random subspace method, for each tree, we are only considering $m$ of the $p$ features. It may take several trees to consider all $p$ predictors even once. On the other hand, if we consider a different subset $m_i$ of the $p$ features at each node, we will consider each feature more times after fewer trees, giving us a more robust estimate of the feature importance.

What I’ve looked at thus far: So far, I have read Breiman’s paper and Ho’s paper, and done a broad online search for comparisons of the methods without finding a definitive answer. Note that a similar question was asked before. This question goes a bit further by including my speculation/work toward a possible solution. I would be interested in any answers, relevant citations, or simulation studies comparing the two approaches. If none are forthcoming, I plan to run my own simulation comparing the two methods.