Is gradient boosting appropriate for data with low event rates like 1%?

I am trying gradient boosting on a dataset with event rate about 1% using Enterprise miner, but it is failing to produce any output. My question is, since it a decision tree based approach, is it even right to use gradient boosting with such low event?

Answer

(To give short answer to this:)

It is fine to use a gradient boosting machine algorithm when dealing with an imbalanced dataset. When dealing with a strongly imbalanced dataset it much more relevant to question the suitability of the metric used. We should potentially avoid metrics, like Accuracy or Recall, that are based on arbitrary thresholds, and opt for metrics, like AUCPR or Brier scoring, that give a more accurate picture – see the excellent CV.SE thread on: Why is accuracy not the best measure for assessing classification models? for more). Similarly, we could potentially employ a cost-sensitive approach by assigning different misclassification costs (e.g. see Masnadi-Shirazi & Vasconcelos (2011) Cost-Sensitive Boosting for a general view and proposed changes to known boosting algorithms or for a particular interesting application with a simpler approach check the Higgs Boson challenge report for the XGBoost algorithm; Chen & He (2015) Higgs Boson Discovery with Boosted Trees provide more details).

It is also worth noting that if we employ a probabilistic classifier (like GBMs) we can/should actively look into calibrating the returned probabilities (e.g. see Zadrozny & Elkan (2002) Transforming classifier scores into accurate multiclass probability estimates or Kull et al. (2017) Beta calibration: a well-founded and easily implemented improvement on logistic calibration for binary classifiers) to potentially augment our learner’s performance. Especially when working with imbalanced data adequately capturing tendency changes might be more informative than simply labelling the data. To that extent, some might argue that cost-sensitive approaches are not that beneficial in the end (e.g. see Nikolaou et al. (2016) Cost-sensitive boosting algorithms: Do we really need them?). To reiterate the original point though, boosting algorithms are not inherently bad for imbalanced data and in certain cases they can offer a very competitive option.

Attribution
Source : Link , Question Author : user2542275 , Answer Author : usεr11852

Leave a Comment