I have been trying to classify a set of data into one of four classes. The data has already been generated and I have set aside 10,000 for training and 2,000 for testing. I have also generated the labels for each of the data. Let’s call the classes – 0,1,2 and 3.
Now when I observe the classification, I notice that there are a lot of 0s in the training data and hence in most cases, the classifier is just learning to predict 0 no matter what the features are. (I am using random forests for classification)
Generating the data again to ensure uniformity, takes a lot of time and I prefer to avoid that. Is there anyway I can still use the data that I have?
Answer
Another way is to oversample: “Oversampling: you duplicate the observations of the minority class to obtain a balanced dataset.” [1]
But note that oversampling of the minority class may lead to overfitting, so be sure to test that.
You also may want to check this paper: Yap, Bee Wah, et al. “An Application of Oversampling, Undersampling, Bagging and Boosting in Handling Imbalanced Datasets.” [2]
Attribution
Source : Link , Question Author : Anirudh Vemula , Answer Author : Alexey Grigorev