I’m currently fitting random forests for a classification problem using the
randomForestpackage in R, and am unsure about how to report training error for these models.
My training error is close to 0% when I compute it using predictions that I get with the command:
X_trainis the training data.
In an answer to a related question, I read that one should use the out-of-bag (OOB) training error as the training error metric for random forests. This quantity is computed from predictions obtained with the command:
In this case, the OOB training error is much closer to the mean 10-CV test error, which is 11%.
I am wondering:
Is it generally accepted to report OOB training error as the training error measure for random forests?
Is it true that the traditional measure of training error is artificially low?
If the traditional measure of training error is artificially low, then what two measures can I compare to check if the RF is overfitting?
To add to @Soren H. Welling’s answer.
1. Is it generally accepted to report OOB training error as the training error measure for random forests?
No. OOB error on the trained model is not the same as training error. It can, however, serve as a measure of predictive accuracy.
2. Is it true that the traditional measure of training error is artificially low?
This is true if we are running a classification problem using default settings. The exact process is described in a forum post by Andy Liaw, who maintains the
randomForest package in R, as follows:
For the most part, performance on training set is meaningless. (That’s
the case for most algorithms, but especially so for RF.) In the default
(and recommended) setting, the trees are grown to the maximum size,
which means that quite likely there’s only one data point in most
terminal nodes, and the prediction at the terminal nodes are determined
by the majority class in the node, or the lone data point. Suppose that
is the case all the time; i.e., in all trees all terminal nodes have
only one data point. A particular data point would be “in-bag” in about
64% of the trees in the forest, and every one of those trees has the
correct prediction for that data point. Even if all the trees where
that data points are out-of-bag gave the wrong prediction, by majority
vote of all trees, you still get the right answer in the end. Thus
basically the perfect prediction on train set for RF is “by design”.
To avoid this behavior, one can set
nodesize > 1 (so that the trees are not grown to maximum size) and/or set
sampsize < 0.5N (so that fewer than 50% of trees are likely to contain a given point (xi,yi).
3. If the traditional measure of training error is artificially low, then what two measures can I compare to check if the RF is overfitting?
If we run RF with
nodesize = 1 and
sampsize > 0.5, then the training error of the RF will always be near 0. In this case, the only way to tell if the model is overfitting is to keep some data as an independent validation set. We can then compare the 10-CV test error (or the OOB test error) to the error on the independent validation set. If the 10-CV test error is much lower than the error on the independent validation set, then the model may be overfitting.