# Random forest and prediction

I am trying to understand how Random Forest works. I have a grasp about how trees are build but can not understand how Random Forest make predictions on out of bag sample. Could anyone give me a simple explanation, please?:)

Each tree in the forest is built from a bootstrap sample of the observations in your training data. Those observations in the bootstrap sample build the tree, whilst those not in the bootstrap sample form the out-of-bag (or OOB) samples.

It should be clear that the same variables are available for cases in the data used to build a tree as for the cases in the OOB sample. To get predictions for the OOB sample, each one is passed down the current tree and the rules for the tree followed until it arrives in a terminal node. That yields the OOB predictions for that particular tree.

This process is repeated a large number of times, each tree trained on a new bootstrap sample from the training data and predictions for the new OOB samples derived.

As the number of trees grows, any one sample will be in the OOB samples more than once, thus the “average” of the predictions over the N trees where a sample is in the OOB is used as the OOB prediction for each training sample for trees 1, …, N. By “average” we use the mean of the predictions for a continuous response, or the majority vote may be used for a categorical response (the majority vote is the class with most votes over the set of trees 1, …, N).

For example, assume we had the following OOB predictions for 10 samples in training set on 10 trees

``````set.seed(123)
oob.p <- matrix(rpois(100, lambda = 4), ncol = 10)
colnames(oob.p) <- paste0("tree", seq_len(ncol(oob.p)))
rownames(oob.p) <- paste0("samp", seq_len(nrow(oob.p)))
oob.p[sample(length(oob.p), 50)] <- NA
oob.p

> oob.p
tree1 tree2 tree3 tree4 tree5 tree6 tree7 tree8 tree9 tree10
samp1     NA    NA     7     8     2     1    NA     5     3      2
samp2      6    NA     5     7     3    NA    NA    NA    NA     NA
samp3      3    NA     5    NA    NA    NA     3     5    NA     NA
samp4      6    NA    10     6    NA    NA     3    NA     6     NA
samp5     NA     2    NA    NA     2    NA     6     4    NA     NA
samp6     NA     7    NA     4    NA     2     4     2    NA     NA
samp7     NA    NA    NA     5    NA    NA    NA     3     9      5
samp8      7     1     4    NA    NA     5     6    NA     7     NA
samp9      4    NA    NA     3    NA     7     6     3    NA     NA
samp10     4     8     2     2    NA    NA     4    NA    NA      4
``````

Where `NA` means the sample was in the training data for that tree (in other words it was not in the OOB sample).

The mean of the non-`NA` values for each row gives the the OOB prediction for each sample, for the entire forest

``````> rowMeans(oob.p, na.rm = TRUE)
samp1  samp2  samp3  samp4  samp5  samp6  samp7  samp8  samp9 samp10
4.00   5.25   4.00   6.20   3.50   3.80   5.50   5.00   4.60   4.00
``````

As each tree is added to the forest, we can compute the OOB error up to an including that tree. For example, below are the cummulative means for each sample:

``````FUN <- function(x) {
na <- is.na(x)
cs <- cumsum(x[!na]) / seq_len(sum(!na))
x[!na] <- cs
x
}
t(apply(oob.p, 1, FUN))

> print(t(apply(oob.p, 1, FUN)), digits = 3)
tree1 tree2 tree3 tree4 tree5 tree6 tree7 tree8 tree9 tree10
samp1     NA    NA  7.00  7.50  5.67  4.50    NA   4.6  4.33    4.0
samp2      6    NA  5.50  6.00  5.25    NA    NA    NA    NA     NA
samp3      3    NA  4.00    NA    NA    NA  3.67   4.0    NA     NA
samp4      6    NA  8.00  7.33    NA    NA  6.25    NA  6.20     NA
samp5     NA     2    NA    NA  2.00    NA  3.33   3.5    NA     NA
samp6     NA     7    NA  5.50    NA  4.33  4.25   3.8    NA     NA
samp7     NA    NA    NA  5.00    NA    NA    NA   4.0  5.67    5.5
samp8      7     4  4.00    NA    NA  4.25  4.60    NA  5.00     NA
samp9      4    NA    NA  3.50    NA  4.67  5.00   4.6    NA     NA
samp10     4     6  4.67  4.00    NA    NA  4.00    NA    NA    4.0
``````

In this way we see how the prediction is accumulated over the N trees in the forest up to a given iteration. If you read across the rows, the right-most non-`NA` value is the one I show above for the OOB prediction. That is how traces of OOB performance can be made – a RMSEP can be computed for the OOB samples based on the OOB predictions accumulated cumulatively over the N trees.

Note that the R code shown is not take from the internals of the randomForest code in the randomForest package for R – I just knocked up some simple code so that you can follow what is going on once the predictions from each tree are determined.

It is because each tree is built from a bootstrap sample and that there are a large number of trees in a random forest, such that each training set observation is in the OOB sample for one or more trees, that OOB predictions can be provided for all samples in the training data.

I have glossed over issues such as missing data for some OOB cases etc, but these issues also pertain to a single regression or classification tree. Also note that each tree in a forest uses only `mtry` randomly-selected variables.