I have a dataset with a few million rows and ~100 columns.

I would like to detect about 1% of the examples in the dataset, which belong to a common class. I have a minimum precision constraint, but due to very asymmetric cost I am not too keen on any particular recall (as long as I am not left with 10 positive matches!)What are some approaches that you would recommend in this setting? (links to papers welcome, links to implementations appreciated)

**Answer**

I’ve found He and Garcia (2009) to be a helpful review of learning in imbalanced class problems. Here are a few definitely-not-comprehensive things to consider:

**Data-based approaches:**

One can undersample the majority class or oversample the minority class. (Breiman pointed out that this is formally the equivalent to assigning non-uniform misclassification costs.) This can cause problems: Undersampling can cause the learner to miss aspects of the majority class; oversampling increases risk of overfitting.

There are “informed undersampling” methods that reduce these issues. One of them is EasyEnsemble, which independently samples several subsets from the majority class and makes multiple classifiers by combining each subset with all the minority class data.

SMOTE (Synthetic Minority Oversampling Technique) or SMOTEBoost, (combining SMOTE with boosting) create synthetic instances of the minority class by making nearest neighbors in the feature space. SMOTE is implemented in R in the DMwR package (which accompanies Luis Torgo’s book “Data Mining with R, learning with case studies” CRC Press 2016).

**Model fitting approaches**

Apply class-specific weights in your loss function (larger weights for minority cases).

For tree-based approaches, you can use Hellinger distance as a node impurity function, as advocated in Cieslak et al. “Hellinger distance decision trees are robust and skew-insensitive” (Weka code here.)

Use a one class classifier, learning either (depending on the model) a probability density or boundary for one class and treating the other class as outliers.

Of course, don’t use accuracy as a metric for model building. Cohen’s kappa is a reasonable alternative.

**Model evaluation approaches**

If your model returns predicted probabilities or other scores, chose a decision cutoff that makes an appropriate tradeoff in errors (using a dataset independent from training and testing). In R, the package OptimalCutpoints implements a number of algorithms, including cost-sensitive ones, for deciding a cutoff.

**Attribution***Source : Link , Question Author : em70 , Answer Author : ayorgo*