I’ve noticed that when building random forest regression models, at least in
R
, the predicted value never exceeds the maximum value of the target variable seen in the training data. As an example, see the code below. I’m building a regression model to predictmpg
based on themtcars
data. I build OLS and random forest models, and use them to predictmpg
for a hypothetical car that should have very good fuel economy. The OLS predicts a highmpg
, as expected, but random forest does not. I’ve noticed this in more complex models too. Why is this?> library(datasets) > library(randomForest) > > data(mtcars) > max(mtcars$mpg) [1] 33.9 > > set.seed(2) > fit1 <- lm(mpg~., data=mtcars) #OLS fit > fit2 <- randomForest(mpg~., data=mtcars) #random forest fit > > #Hypothetical car that should have very high mpg > hypCar <- data.frame(cyl=4, disp=50, hp=40, drat=5.5, wt=1, qsec=24, vs=1, am=1, gear=4, carb=1) > > predict(fit1, hypCar) #OLS predicts higher mpg than max(mtcars$mpg) 1 37.2441 > predict(fit2, hypCar) #RF does not predict higher mpg than max(mtcars$mpg) 1 30.78899
Answer
As it has been mentioned already in previous answers, random forest for regression / regression trees doesn’t produce expected predictions for data points beyond the scope of training data range because they cannot extrapolate (well). A regression tree consists of a hierarchy of nodes, where each node specifies a test to be carried out on an attribute value and each leaf (terminal) node specifies a rule to calculate a predicted output. In your case the testing observation flow through the trees to leaf nodes stating, e.g., “if x > 335, then y = 15”, which are then averaged by random forest.
Here is an R script visualizing the situation with both random forest and linear regression. In random forest’s case, predictions are constant for testing data points that are either below the lowest training data x-value or above the highest training data x-value.
library(datasets)
library(randomForest)
library(ggplot2)
library(ggthemes)
# Import mtcars (Motor Trend Car Road Tests) dataset
data(mtcars)
# Define training data
train_data = data.frame(
x = mtcars$hp, # Gross horsepower
y = mtcars$qsec) # 1/4 mile time
# Train random forest model for regression
random_forest <- randomForest(x = matrix(train_data$x),
y = matrix(train_data$y), ntree = 20)
# Train linear regression model using ordinary least squares (OLS) estimator
linear_regr <- lm(y ~ x, train_data)
# Create testing data
test_data = data.frame(x = seq(0, 400))
# Predict targets for testing data points
test_data$y_predicted_rf <- predict(random_forest, matrix(test_data$x))
test_data$y_predicted_linreg <- predict(linear_regr, test_data)
# Visualize
ggplot2::ggplot() +
# Training data points
ggplot2::geom_point(data = train_data, size = 2,
ggplot2::aes(x = x, y = y, color = "Training data")) +
# Random forest predictions
ggplot2::geom_line(data = test_data, size = 2, alpha = 0.7,
ggplot2::aes(x = x, y = y_predicted_rf,
color = "Predicted with random forest")) +
# Linear regression predictions
ggplot2::geom_line(data = test_data, size = 2, alpha = 0.7,
ggplot2::aes(x = x, y = y_predicted_linreg,
color = "Predicted with linear regression")) +
# Hide legend title, change legend location and add axis labels
ggplot2::theme(legend.title = element_blank(),
legend.position = "bottom") + labs(y = "1/4 mile time",
x = "Gross horsepower") +
ggthemes::scale_colour_colorblind()
Attribution
Source : Link , Question Author : Gaurav Bansal , Answer Author : tuomastik