I am using gbm package for classification. As expected, the results is good. But I am trying to understand the output of the classifier.
There are five terms in output.
`Iter TrainDeviance ValidDeviance StepSize Improve`
Could anyone explain the meaning of each term, especially the meaning of Improve.
You should find these are related to determining the best value for the number of basis functions – i.e. iterations – i.e. number of trees in the additive model. I cant find documentation describing exactly what these are but here is my best guess and maybe someone else can comment.
Take the following from the manual:
library(gbm) # A least squares regression example # create some data N <- 1000 X1 <- runif(N) X2 <- 2*runif(N) X3 <- ordered(sample(letters[1:4],N,replace=TRUE),levels=letters[4:1]) X4 <- factor(sample(letters[1:6],N,replace=TRUE)) X5 <- factor(sample(letters[1:3],N,replace=TRUE)) X6 <- 3*runif(N) mu <- c(-1,0,1,2)[as.numeric(X3)] SNR <- 10 # signal-to-noise ratio Y <- X1**1.5 + 2 * (X2**.5) + mu sigma <- sqrt(var(Y)/SNR) Y <- Y + rnorm(N,0,sigma) # introduce some missing values X1[sample(1:N,size=500)] <- NA X4[sample(1:N,size=300)] <- NA data <- data.frame(Y=Y,X1=X1,X2=X2,X3=X3,X4=X4,X5=X5,X6=X6) # fit initial model gbm1 <- gbm(Y~X1+X2+X3+X4+X5+X6, # formula data=data, # dataset var.monotone=c(0,0,0,0,0,0), # -1: monotone decrease, # +1: monotone increase, # 0: no monotone restrictions distribution="gaussian", # bernoulli, adaboost, gaussian, # poisson, coxph, and quantile available n.trees=3000, # number of trees shrinkage=0.005, # shrinkage or learning rate, # 0.001 to 0.1 usually work interaction.depth=3, # 1: additive model, 2: two-way interactions, etc. bag.fraction = 0.5, # subsampling fraction, 0.5 is probably best train.fraction = 0.5, # fraction of data for training, # first train.fraction*N used for training n.minobsinnode = 10, # minimum total weight needed in each node cv.folds = 5, # do 5-fold cross-validation keep.data=TRUE, # keep a copy of the dataset with the object verbose=TRUE) # print out progress
The number of iterations (
Iter) is 3000, which is the number of trees selected to be built (1 to 3000 although not every one is shown). The full process is repeated 5 times by the way because we selected cv.folds=5.
StepSize is the shrinkage or learning rate selected (0.005 here).
I believe that
Improve is the reduction in the deviance (loss function) by adding another tree and is calculated using the out-of-bag (OOB) records (note it will not be calculated if bag.fraction is not <1).
Then for each iteration, the
TrainDeviance ValidDeviance is the value of the loss function on the training data and hold out data (a single hold out set). The ValidDeviance will not be calculated if
train.fraction is not <1.
Have you seen this which describes the 3 types of methods for determining the optimal number of trees?