I’m playing with a randomForest and have found that generally increasing the sampSize leads to better performance. Is there a rule / formula / etc that suggests what the optimal sampSize should be or is it a trial and error thing? I guess another way of phrasing it; what are my risks of too small of a sampSize or too large (overfitting?)?

This question is referring to the R implementation of random forest in the

`randomForest`

package. The function`randomForest`

has a parameter`sampSize`

which is described in the documentation asSize(s) of sample to draw. For classification, if sampsize is a vector of the length the number of strata, then sampling is stratified by strata, and the elements of sampsize indicate the numbers to be drawn from the strata.

**Answer**

In general, the sample size for a random forest acts as a control on the “degree of randomness” involved, and thus as a way of adjusting the bias-variance tradeoff. Increasing the sample size results in a “less random” forest, and so has a tendency to overfit. Decreasing the sample size increases the variation in the individual trees within the forest, preventing overfitting, but usually at the expense of model performance. A useful side-effect is that lower sample sizes reduce the time needed to train the model.

The usual rule of thumb for the best sample size is a “bootstrap sample”, a sample equal in size to the original dataset, but selected with replacement, so some rows are not selected, and others are selected more than once. This typically provides near-optimal performance, and is the default in the standard R implementation. However, you may find in real-world applications that adjusting the sample size can lead to improved performance. When in doubt, select the appropriate sample size (and other model parameters) using cross-validation.

**Attribution***Source : Link , Question Author : screechOwl , Answer Author : Martin O’Leary*