Regression with zero inflated continuous response variable using gradient boosting trees and random forest

I have a data set with a lot of 0 values for the continuous response variable (about 50%). I want to understand how well gradient boosting/random forest deals with this problem. My colleague suggested doing a two part model with classification as the first step to predict the 0’s and second step doing regression. Is this necessary?

p.s. I’m using xgboost in R.


Updated problem statement

Given data:

  • comprised of 50% zero padding, 50% of something else
  • sufficient in size and complexity that xgboost (or equivalent) is needed
  • nonzero data is described by a multivariate linear relationship so multivariate regression is apropos


  • HOW to split task into “is zero vs. is not” and “if not, then fit linear”
  • WHEN to …


Here is our data.

  • We can use the Kaggle “Human Resource Analytics” challenge
    dataset, but with modified goal.
  • In the challenge, the goal is to predict “whether” the employee will
    leave, so the output is the class, not the regression. We have to
    modify that for our purposes. Let’s suppose that the “satisfaction”
    is some HR self-congratulatory hack, and poorly represents actual
    satisfaction. Let’s presume also that a strongly truly satisfied
    employee doesn’t tend to leave, and one that is unsatisfied does tend
    to leave.


  • first we use gbm (xgboost or other) to use all columns but satisfaction level and “left” to determine if they left.
  • second we use the left class to regress on satisfaction.
  • finally we compare to see if there are two fundamentally different sets of “physics” driving satisfaction.


  • I am going to use r + ‘h2o’, but the process and results should generalize to any gradient boosted machine including xgboost. I like the H2O flow interface through the browser. I also like to use a random-forest as a robust estimator of central tendency. It is really hard to over-fit a random forest.

Fit of nominal (did employee leave)

library(h2o) #gbm

#spin up h2o
h2o.init(nthreads = -1) #use this computer to the max

#import data
mydata <- h2o.importFile("HR_comma_sep.csv")

mydata[,7] <- as.factor(mydata[,7])

#split data
splits <- h2o.splitFrame(mydata,           

train.hex <- h2o.assign(splits[[1]], "train.hex")   
valid.hex <- h2o.assign(splits[[2]], "valid.hex") 

#stage for gbm
idxx <- 1:10
idxx <- idxx[-c(1,7)]

idxy <- 7

Nn   <- 300
Lambda <- 0.1

#fit data
my_fit.gbm <- h2o.gbm(y=idxy,
                      training_frame = train.hex,
                      validation_frame = valid.hex,
                      model_id = "my_fit.gbm",
                      learn_rate = Lambda,
                      score_each_iteration = TRUE)


The purpose of training/validation is to “dial in the parameters” to a decent level, and to estimate operational uncertainty. When the dials are set, and we have estimates of how what step 1 errors are, then we train on the whole data to move to the second step. In this case I am moving fast so that is not done here. I predict on the model used for tuning parameters.

Convergence is fair, although the example here is nothing close to rigorous.
enter image description here

Here is the baseline RF

my_fit.rf <- h2o.randomForest(y=idxy,
                              training_frame = train.hex,
                              validation_frame = valid.hex,
                              model_id = "my_fit.rf",
                              score_each_iteration = TRUE)


Its confusion matrix is:

Confusion Matrix (vertical: actual; across: predicted)  for max f1 @ threshold = 0.469005852414851:
          0    1    Error        Rate
0      9008  100 0.010979   =100/9108
1        98 2762 0.034266    =98/2860
Totals 9106 2862 0.016544  =198/11968

Comparison of this with confusion matrix from the GBM fit suggests we have around 93.6% positive predictive value, and we are in the right area to not be over-fitting.

Here is the confusion matrix for the GBM:

Confusion Matrix (vertical: actual; across: predicted)  for max f1 @ threshold = 0.410323012296247:
          0    1    Error        Rate
0      9030   87 0.009543    =87/9117
1       103 2723 0.036447   =103/2826
Totals 9133 2810 0.015909  =190/11943

So let’s predict the “did they leave” on the whole data, and use it to model the “satisfaction_level”.

Here we predict and augment the data

pred_left.hex  <- h2o.predict(my_fit.gbm, 
                         newdata = mydata,

mydata2 <- h2o.cbind(mydata, pred_left.hex)

Here we make prediction of “satisfaction”

#stage for second gbm
idxx2 <- 1:13
idxx2 <- idxx2[-c(1,7)]

idxy2 <- 1

Nn   <- 300
Lambda <- 0.05

#split data
splits2 <- h2o.splitFrame(mydata2,           

train2.hex <- h2o.assign(splits2[[1]], "train2.hex")   
valid2.hex <- h2o.assign(splits2[[2]], "valid2.hex") 

#fit data
my_fit2.gbm <- h2o.gbm(y=idxy2,
                      training_frame = train2.hex,
                      validation_frame = valid2.hex,
                      model_id = "my_fit2.gbm",
                      learn_rate = Lambda,
                      score_each_iteration = TRUE)

As long as it is a “fair” model, the variable importance is going to show whether this has utility.

Here is the “RF as a gross reality check”

my_fit2.rf <- h2o.randomForest(y=idxy2,
                              training_frame = train2.hex,
                              validation_frame = valid2.hex,
                              model_id = "my_fit2.rf",
                              score_each_iteration = TRUE)

The RF converged

enter image description here

Fit metrics give mae around 0.13

enter image description here

Here is the GBM results

enter image description here

Now, I did nearly nothing in the way of real tuning. A decent GBM can usually outperform an RF for accuracy by quite a bit. It can also over-fit, which is a bad thing that requires a litle time and effort to resolve.

Our typical error scale of 13% (mae = mean absolute error) isn’t bad. It is consistent with the RF, but there is something much more interesting.

Here is what the GBM gives for variable importance (and the key you are looking for).
enter image description here

Notice that the “P0”, the probability of staying, is the number 2 most informative value in the set. It is stunningly more important than salary, work hours, accidents, or previous review. It is, in fact, more informative than the bottom 8 variables combined even though it is a function of them.

From this we might say that any HR that claims all “satisfaction scores” are created equal, given this data, is junk; we shouldn’t be as surprised as they are that the “best and most experienced employees are leaving prematurely”. With only a little work, the predictive values should be moved to the late 90’s, even on real-world data.

This also shows how having the class probabilities as an input can be substantially informative.


  • If P0 was contrived as log of probability, or as log-odds, then it might be even more informative for the fundamental learner, the CART.
  • Again, the GBM could be substantially improved by adjusting control parameters. This is practically “shoot from the hip”.

There is also a package called “lime” that is about unpacking variable importance from black box models like random forests. (ref)

Source : Link , Question Author : user1569341 , Answer Author : EngrStudent

Leave a Comment