How to get rid of bias in data?

I have been trying to classify a set of data into one of four classes. The data has already been generated and I have set aside 10,000 for training and 2,000 for testing. I have also generated the labels for each of the data. Let’s call the classes – 0,1,2 and 3.

Now when I observe the classification, I notice that there are a lot of 0s in the training data and hence in most cases, the classifier is just learning to predict 0 no matter what the features are. (I am using random forests for classification)

Generating the data again to ensure uniformity, takes a lot of time and I prefer to avoid that. Is there anyway I can still use the data that I have?

Answer

Another way is to oversample: “Oversampling: you duplicate the observations of the minority class to obtain a balanced dataset.” [1]

But note that oversampling of the minority class may lead to overfitting, so be sure to test that.

You also may want to check this paper: Yap, Bee Wah, et al. “An Application of Oversampling, Undersampling, Bagging and Boosting in Handling Imbalanced Datasets.” [2]

Attribution
Source : Link , Question Author : Anirudh Vemula , Answer Author : Alexey Grigorev

Leave a Comment