I am exploring different classification methods for a project I am working on, and am interested in trying Random Forests. I am trying to educate myself as I go along, and would appreciate any help provided by the CV community.
I have split my data into training/test sets. From experimentation with random forests in R (using the randomForest package), I have been having trouble with a high misclassification rate for my smaller class. I have read this paper concerning the performance of random forests on imbalanced data, and the authors presented two methods with dealing with class imbalance when using random forests.
1. Weighted Random Forests
2. Balanced Random Forests
The R package does not allow weighting of the classes (from the R help forums, I have read the classwt parameter is not performing properly and is scheduled as a future bug fix), so I am left with option 2. I am able to specify the number of objects sampled from each class for each iteration of the random forest.
I feel uneasy about setting equal sample sizes for random forests, as I feel like I would be losing too much information about the larger class leading to poor performance with future data. The misclassification rates when downsampling the larger class has shown to improve, but I was wondering if there were other ways to deal with imbalanced class sizes in random forests?
If you don’t like those options, have you considered using a boosting method instead? Given an appropriate loss function, boosting automatically recalibrates the weights as it goes along. If the stochastic nature of random forests appeals to you, stochastic gradient boosting builds that in as well.