Is there a formula or rule for determining the correct sampSize for a randomForest?

I’m playing with a randomForest and have found that generally increasing the sampSize leads to better performance. Is there a rule / formula / etc that suggests what the optimal sampSize should be or is it a trial and error thing? I guess another way of phrasing it; what are my risks of too small of a sampSize or too large (overfitting?)?


This question is referring to the R implementation of random forest in the randomForest package. The function randomForest has a parameter sampSize which is described in the documentation as

Size(s) of sample to draw. For classification, if sampsize is a vector of the length the number of strata, then sampling is stratified by strata, and the elements of sampsize indicate the numbers to be drawn from the strata.

Answer

In general, the sample size for a random forest acts as a control on the “degree of randomness” involved, and thus as a way of adjusting the bias-variance tradeoff. Increasing the sample size results in a “less random” forest, and so has a tendency to overfit. Decreasing the sample size increases the variation in the individual trees within the forest, preventing overfitting, but usually at the expense of model performance. A useful side-effect is that lower sample sizes reduce the time needed to train the model.

The usual rule of thumb for the best sample size is a “bootstrap sample”, a sample equal in size to the original dataset, but selected with replacement, so some rows are not selected, and others are selected more than once. This typically provides near-optimal performance, and is the default in the standard R implementation. However, you may find in real-world applications that adjusting the sample size can lead to improved performance. When in doubt, select the appropriate sample size (and other model parameters) using cross-validation.

Attribution
Source : Link , Question Author : screechOwl , Answer Author : Martin O’Leary

Leave a Comment