R: What do I see in partial dependence plots of gbm and RandomForest?

Actually, I thought I had understood what one can show a with partial dependence plot, but using a very simple hypothetical example, I got rather puzzled. In the following chunk of code I generate three independent variables (a, b, c) and one dependent variable (y) with c showing a close linear relationship with y, while a and b are uncorrelated with y. I make a regression analysis with a boosted regression tree using the R package gbm:

a <- runif(100, 1, 100)
b <- runif(100, 1, 100)
c <- 1:100 + rnorm(100, mean = 0, sd = 5)
y <- 1:100 + rnorm(100, mean = 0, sd = 5)
par(mfrow = c(2,2))
plot(y ~ a); plot(y ~ b); plot(y ~ c)
Data <- data.frame(matrix(c(y, a, b, c), ncol = 4))
names(Data) <- c("y", "a", "b", "c")
library(gbm)
gbm.gaus <- gbm(y ~ a + b + c, data = Data, distribution = "gaussian")
par(mfrow = c(2,2))
plot(gbm.gaus, i.var = 1)
plot(gbm.gaus, i.var = 2)
plot(gbm.gaus, i.var = 3)

Not surprisingly, for variables a and b the partial dependence plots yield horizontal lines around the mean of a. What me puzzles is the plot for variable c. I get horizontal lines for the ranges c < 40 and c > 60 and the y-axis is restricted to values close to the mean of y. Since a and b are completely unrelated to y (and thus there variable importance in the model is 0), I expected that c would show partial dependence along its entire range instead of that sigmoid shape for a very restricted range of its values. I tried to find information in Friedman (2001) “Greedy function approximation: a gradient boosting machine” and in Hastie et al. (2011) “Elements of Statistical Learning”, but my mathematical skills are too low to understand all the equations and formulae therein. Thus my question: What determines the shape of the partial dependence plot for variable c? (Please explain in words comprehensible to a non-mathematician!)

ADDED on 17th April 2014:

While waiting for a response, I used the same example data for an analysis with R-package randomForest. The partial dependence plots of randomForest resemble much more to what I expected from the gbm plots: the partial dependence of explanatory variables a and b vary randomly and closely around 50, while explanatory variable c shows partial dependence over its entire range (and over almost the entire range of y). What could be the reasons for these different shapes of the partial dependence plots in gbm and randomForest?

partial plots of gbm and randomForest

Here the modified code that compares the plots:

a <- runif(100, 1, 100)
b <- runif(100, 1, 100)
c <- 1:100 + rnorm(100, mean = 0, sd = 5)
y <- 1:100 + rnorm(100, mean = 0, sd = 5)
par(mfrow = c(2,2))
plot(y ~ a); plot(y ~ b); plot(y ~ c)
Data <- data.frame(matrix(c(y, a, b, c), ncol = 4))
names(Data) <- c("y", "a", "b", "c")

library(gbm)
gbm.gaus <- gbm(y ~ a + b + c, data = Data, distribution = "gaussian")

library(randomForest)
rf.model <- randomForest(y ~ a + b + c, data = Data)

x11(height = 8, width = 5)
par(mfrow = c(3,2))
par(oma = c(1,1,4,1))
plot(gbm.gaus, i.var = 1)
partialPlot(rf.model, Data[,2:4], x.var = "a")
plot(gbm.gaus, i.var = 2)
partialPlot(rf.model, Data[,2:4], x.var = "b")
plot(gbm.gaus, i.var = 3)
partialPlot(rf.model, Data[,2:4], x.var = "c")
title(main = "Boosted regression tree", outer = TRUE, adj = 0.15)
title(main = "Random forest", outer = TRUE, adj = 0.85)

Answer

I spent some time writing my own “partial.function-plotter” before I realized it was already bundled in the R randomForest library.

[EDIT …but then I spent a year making the CRAN package forestFloor, which is by my opinion significantly better than classical partial dependence plots]

Partial.function plot are great in instances as this simulation example you show here, where the explaining variable do not interact with other variables. If each explaining variable contribute additively to the target-Y by some unknown function, this method is great to show that estimated hidden function. I often see such flattening in the borders of partial functions.

Some reasons:
randomForsest has an argument called ‘nodesize=5’ which means no tree will subdivide a group of 5 members or less. Therefore each tree cannot distinguish with further precision.
Bagging/bootstrapping layer of ensemple smooths by voting the many step functions of the individual trees – but only in the middle of the data region. Nearing the borders of data represented space, the ‘amplitude’ of the partial.function will fall.
Setting nodesize=3 and/or get more observations compared to noise can reduce this border flatting effect…
When signal to noise ratio falls in general in random forest the predictions scale condenses. Thus the predictions are not absolutely terms accurate, but only linearly correlated with target. You can see the a and b values as examples of and extremely low signal to noise ratio, and therefore these partial functions are very flat. It’s a nice feature of random forest that you already from the range of predictions of training set can guess how well the model is performing. OOB.predictions is great also..

flattening of partial plot in regions with no data is reasonable:
As random forest and CART are data driven modeling, I personally like the concept that these models do not extrapolate. Thus prediction of c=500 or c=1100 is the exactly same as c=100 or in most instances also c=98.

Here is a code example with the border flattening is reduced:

I have not tried the gbm package…

here is some illustrative code based on your eaxample…

#more observations are created...
a <- runif(5000, 1, 100)
b <- runif(5000, 1, 100)
c <- (1:5000)/50 + rnorm(100, mean = 0, sd = 0.1)
y <- (1:5000)/50 + rnorm(100, mean = 0, sd = 0.1)
par(mfrow = c(1,3))
plot(y ~ a); plot(y ~ b); plot(y ~ c)
Data <- data.frame(matrix(c(y, a, b, c), ncol = 4))
names(Data) <- c("y", "a", "b", "c")
library(randomForest)
#smaller nodesize "not as important" when there number of observartion is increased
#more tress can smooth flattening so boundery regions have best possible signal to             noise, data specific how many needed

plot.partial = function() {
partialPlot(rf.model, Data[,2:4], x.var = "a",xlim=c(1,100),ylim=c(1,100))
partialPlot(rf.model, Data[,2:4], x.var = "b",xlim=c(1,100),ylim=c(1,100))
partialPlot(rf.model, Data[,2:4], x.var = "c",xlim=c(1,100),ylim=c(1,100))
}

#worst case! : with 100 samples from Data and nodesize=30
rf.model <- randomForest(y ~ a + b + c, data = Data[sample(5000,100),],nodesize=30)
plot.partial()

#reasonble settings for least partial flattening by few observations: 100 samples and nodesize=3 and ntrees=2000
#more tress can smooth flattening so boundery regions have best possiblefidelity
rf.model <- randomForest(y ~ a + b + c, data = Data[sample(5000,100),],nodesize=5,ntress=2000)
plot.partial()

#more observations is great!
rf.model <- randomForest(y ~ a + b + c,
 data = Data[sample(5000,5000),],
 nodesize=5,ntress=2000)
plot.partial()

Attribution
Source : Link , Question Author : user7417 , Answer Author : Community

Leave a Comment