When we assess the quality of a Random Forest, for example using AUC, is it more appropriate to compute these quantities over the Out of Bag Samples or over the hold out set of cross validation?

I hear that computing it over the OOB Samples gives a more pessimistic assessment, but I don’t see why.

**Answer**

**Note:** *While I feel that my answer is probably correct, I also feel doubtful due to the fact that I made all this up by thinking about this problem only after reading this question for about 30-60 minutes. So you better be sceptical and scrutinize this and not get fooled by my possibly overly confident writing style (me using big words and fancy Greek symbols doesn’t mean I am right).*

# Summary

This is just a summary. All details are mentioned in sections \S1 and \S2 below.

Let’s assume the case of classification (can be extended to regression too, but omit for brevity). Essentially, our goal is to estimate the error of a forest of trees. Both out-of-bag error and k-fold cross-validation try to tell us the probability that:

- The forest gives the correct classification (k-fold cross-validation looks at it this way).

Which is identical to the probability that:

- The majority vote of forest’s trees is the correct vote (OOBE looks at it this way).

And both are identical. The only difference is that k-fold cross-validation and OOBE assume different size of learning samples. For example:

- In 10-fold cross-validation, the learning set is 90%, while the testing set is 10%.
- However, in OOBE if each bag has n samples, such that n = total number of samples in the whole samples set, then this implies that the learning set is practically about 66% (two third) and the testing set is about 33% (one third).

Therefore in my view the only reason why OOBE is a pessimistic estimation of forest’s error is only because it *usually* trains by a smaller number of samples than usually done with k-fold cross-validation (where 10 folds is common).

Due to that, I also think that 2-fold cross-validation is going to be more pessimistic estimation of forest’s error than OOBE, and 3-fold cross-validation to be approximately equally pessimistic to OOBE.

# 1. Understanding out-of-bag error

### 1.1 Common view on bagging

Each tree in RF is grown by a list of n samples that are *randomly* drawn from the learning set \mathcal{X} with replacement. This way, the n many samples can have duplicates, and if n = |\mathcal{X}| then it can be found that approximately one 3rd of the samples in \mathcal{X} are likely to end up not being in the list of n samples that are used to grow a given tree (these are the out-of-bag samples of this specific tree. This process is independently repeated for each tree, so each tree has a different set of out-of-bag samples.

### 1.2. Another view on bagging

Now, let’s re-describe bagging a bit differently with the hope of finding an equal description that is hopefully simpler to deal with.

I do this by stating that the tree t is trained by the bagged samples in the set \mathcal{X}_t \subseteq \mathcal{X}. However, this is not exactly true as the set \mathcal{X}_t does not have duplicated samples (this is how sets work), while -on the other hand- the n list of samples can have duplicates.

Therefore, we can say that a tree t is grown by analysing samples \mathcal{X}_t *plus* a number of randomly chosen duplicates drawn from \mathcal{X}_t, namely \mathcal{X}_{t,1}, \mathcal{X}_{t,2}, \ldots, \mathcal{X}_{t,r} \subseteq \mathcal{X}_t, such that:

\begin{equation}

|\mathcal{X}_t| + \sum_{i=1}^r|\mathcal{X}_{t,i}| = n

\end{equation}

It is trivial to see that from this collection of sets \mathcal{C} = \{\mathcal{X}_t, \mathcal{X}_{t,1}, \ldots, \mathcal{X}_{t,r}\}, we can define a list of n-many samples that contain duplicates by simply appending the elements in each set \mathcal{C}_i \in \mathcal{C} to an array a. This way, for any 1 \le p \le n, there exists at least one value of i such that a[p] \in \mathcal{C}_i.

We can also see that the list of n samples in the array a is a generalization of bagging as I defined in Section 1. It is trivial to see that for some specific definition of \mathcal{X}_t that I have defined in this section (\S2), list of samples in array a can be exactly *identical* to the list of samples as defined in Section 1.

### 1.3. Simplifying bagging

Instead of growing tree t by samples in array a, we will grow them by the duplication-free list of instances that are found in \mathcal{X}_t only.

I believe that, if n is large enough, a tree t that is grown by analysing samples in \mathcal{X}_t is identical to another tree t’ that is grown from the samples in array a.

My reason is that, the probability of duplicating samples in \mathcal{X}_t is equally likely across other samples in the same set. This means that, when we measure the information gain (IG) of some split, the IG will remain identical as the entropies will remain identical too.

And the reason I believe entropies won’t change systematically for a given split is because the empirically measured probability of a sample having a specific label in some sub-set (after applying a decision split) won’t change either.

And the reason the probabilities shouldn’t change in my view is that that all samples in \mathcal{X}_t are equally likely to be duplicated into d copies.

### 1.4 Measuring out-of-bag errors

Let \mathcal{O}_t be the out-of-bag samples of tree t. I.e. \mathcal{O}_t = \mathcal{X} \setminus \mathcal{X}_t. Then the error of a single tree t is:

\begin{equation}

\frac{\text{total $\mathbf{x}$ in $\mathcal{O}_t$ correctly classified by $t$}}{|\mathcal{O}_t|}

\end{equation}

And the total error of the forest with n_t many trees is:

\begin{equation}

\frac{\sum_{t=1}^{n_t} \text{total $\mathbf{x}$ in $\mathcal{O}_t$ correctly classified by $t$}}{\sum_{t=1}^{n_t}|\mathcal{O}_t|}

\end{equation}

which can be thought as the empirically measured *probability that the majority vote of all trees in a forest is a correct vote*.

# 2. Understanding k-fold cross-validation

First we partition the learning set \mathcal{X} into n_k many equally-sized partitions namely \mathcal{K} = \{\mathcal{K}_1, \mathcal{K}_2, \ldots, \mathcal{K}_{n_k}\}. I.e. \mathcal{K}_1 \cup \mathcal{K}_2 \cup \ldots \cup \mathcal{K}_{n_k} = \mathcal{X}, and for any \mathcal{K}_i, \mathcal{K}_j \in \mathcal{K}, \mathcal{K}_i \cap \mathcal{K}_j = \emptyset (this is what portioning implies).

Let \mathcal{K}_t be the testing fold, and \mathcal{K} \setminus \{\mathcal{K}_t\} be the set of learning folds.

Let f be a forest of some trees that is built by using \mathcal{K} \setminus \{\mathcal{K}_t\} as the learning set.

Then, k-fold cross-validation of forest f is:

\begin{equation}

\frac{\sum_{t=1}^{n_k} \text{total $\mathbf{x}$ in $\mathcal{K}_t$ correctly classified by $f$}}{\sum_{t=1}^{n_k} |\mathcal{K}_t|}

\end{equation}

Which is also probability that forest f classifies any input sample correctly.

**Attribution***Source : Link , Question Author : user695652 , Answer Author : caveman*