I have a data set with a lot of 0 values for the continuous response variable (about 50%). I want to understand how well gradient boosting/random forest deals with this problem. My colleague suggested doing a two part model with classification as the first step to predict the 0’s and second step doing regression. Is this necessary?
p.s. I’m using xgboost in R.
Updated problem statement
- comprised of 50% zero padding, 50% of something else
- sufficient in size and complexity that xgboost (or equivalent) is needed
- nonzero data is described by a multivariate linear relationship so multivariate regression is apropos
- HOW to split task into “is zero vs. is not” and “if not, then fit linear”
- WHEN to …
Here is our data.
- We can use the Kaggle “Human Resource Analytics” challenge
dataset, but with modified goal.
- In the challenge, the goal is to predict “whether” the employee will
leave, so the output is the class, not the regression. We have to
modify that for our purposes. Let’s suppose that the “satisfaction”
is some HR self-congratulatory hack, and poorly represents actual
satisfaction. Let’s presume also that a strongly truly satisfied
employee doesn’t tend to leave, and one that is unsatisfied does tend
- first we use gbm (xgboost or other) to use all columns but satisfaction level and “left” to determine if they left.
- second we use the left class to regress on satisfaction.
- finally we compare to see if there are two fundamentally different sets of “physics” driving satisfaction.
- I am going to use r + ‘h2o’, but the process and results should generalize to any gradient boosted machine including xgboost. I like the H2O flow interface through the browser. I also like to use a random-forest as a robust estimator of central tendency. It is really hard to over-fit a random forest.
Fit of nominal (did employee leave)
#library library(h2o) #gbm #spin up h2o h2o.init(nthreads = -1) #use this computer to the max #import data mydata <- h2o.importFile("HR_comma_sep.csv") mydata[,7] <- as.factor(mydata[,7]) #split data splits <- h2o.splitFrame(mydata, c(0.8)) train.hex <- h2o.assign(splits[], "train.hex") valid.hex <- h2o.assign(splits[], "valid.hex") #stage for gbm idxx <- 1:10 idxx <- idxx[-c(1,7)] idxy <- 7 Nn <- 300 Lambda <- 0.1 #fit data my_fit.gbm <- h2o.gbm(y=idxy, x=idxx, training_frame = train.hex, validation_frame = valid.hex, model_id = "my_fit.gbm", ntrees=Nn, learn_rate = Lambda, score_each_iteration = TRUE) h2o.confusionMatrix(my_fit.gbm)
The purpose of training/validation is to “dial in the parameters” to a decent level, and to estimate operational uncertainty. When the dials are set, and we have estimates of how what step 1 errors are, then we train on the whole data to move to the second step. In this case I am moving fast so that is not done here. I predict on the model used for tuning parameters.
Here is the baseline RF
my_fit.rf <- h2o.randomForest(y=idxy, x=idxx, training_frame = train.hex, validation_frame = valid.hex, model_id = "my_fit.rf", ntrees=150, score_each_iteration = TRUE) h2o.confusionMatrix(my_fit.rf)
Its confusion matrix is:
Confusion Matrix (vertical: actual; across: predicted) for max f1 @ threshold = 0.469005852414851: 0 1 Error Rate 0 9008 100 0.010979 =100/9108 1 98 2762 0.034266 =98/2860 Totals 9106 2862 0.016544 =198/11968
Comparison of this with confusion matrix from the GBM fit suggests we have around 93.6% positive predictive value, and we are in the right area to not be over-fitting.
Here is the confusion matrix for the GBM:
Confusion Matrix (vertical: actual; across: predicted) for max f1 @ threshold = 0.410323012296247: 0 1 Error Rate 0 9030 87 0.009543 =87/9117 1 103 2723 0.036447 =103/2826 Totals 9133 2810 0.015909 =190/11943
So let’s predict the “did they leave” on the whole data, and use it to model the “satisfaction_level”.
Here we predict and augment the data
pred_left.hex <- h2o.predict(my_fit.gbm, newdata = mydata, destination_frame="pred_left.hex") mydata2 <- h2o.cbind(mydata, pred_left.hex)
Here we make prediction of “satisfaction”
#stage for second gbm idxx2 <- 1:13 idxx2 <- idxx2[-c(1,7)] idxy2 <- 1 Nn <- 300 Lambda <- 0.05 #split data splits2 <- h2o.splitFrame(mydata2, c(0.8)) train2.hex <- h2o.assign(splits2[], "train2.hex") valid2.hex <- h2o.assign(splits2[], "valid2.hex") #fit data my_fit2.gbm <- h2o.gbm(y=idxy2, x=idxx2, training_frame = train2.hex, validation_frame = valid2.hex, model_id = "my_fit2.gbm", ntrees=Nn, learn_rate = Lambda, score_each_iteration = TRUE)
As long as it is a “fair” model, the variable importance is going to show whether this has utility.
Here is the “RF as a gross reality check”
my_fit2.rf <- h2o.randomForest(y=idxy2, x=idxx2, training_frame = train2.hex, validation_frame = valid2.hex, model_id = "my_fit2.rf", ntrees=150, score_each_iteration = TRUE)
The RF converged
Fit metrics give mae around 0.13
Here is the GBM results
Now, I did nearly nothing in the way of real tuning. A decent GBM can usually outperform an RF for accuracy by quite a bit. It can also over-fit, which is a bad thing that requires a litle time and effort to resolve.
Our typical error scale of 13% (mae = mean absolute error) isn’t bad. It is consistent with the RF, but there is something much more interesting.
Notice that the “P0”, the probability of staying, is the number 2 most informative value in the set. It is stunningly more important than salary, work hours, accidents, or previous review. It is, in fact, more informative than the bottom 8 variables combined even though it is a function of them.
From this we might say that any HR that claims all “satisfaction scores” are created equal, given this data, is junk; we shouldn’t be as surprised as they are that the “best and most experienced employees are leaving prematurely”. With only a little work, the predictive values should be moved to the late 90’s, even on real-world data.
This also shows how having the class probabilities as an input can be substantially informative.
- If P0 was contrived as log of probability, or as log-odds, then it might be even more informative for the fundamental learner, the CART.
- Again, the GBM could be substantially improved by adjusting control parameters. This is practically “shoot from the hip”.
There is also a package called “lime” that is about unpacking variable importance from black box models like random forests. (ref)