Including Interaction Terms in Random Forest

Suppose we have a response Y and predictors X1,….,Xn. If we were to try to fit Y via a linear model of X1,….,Xn, and it just so happened that the true relationship between Y and X1,…,Xn wasn’t linear, we might be able to fix the model by transforming the X’s somehow and then fitting the model. Moreover, if it just so happened that the X1,…,XN didn’t affect y independent of the other features, we also might be able to improve the model by including interaction terms, x1*x3 or x1*x4*x7 or something of the like. So in the linear case, interaction terms might bring value by fixing non-linearity or independence violations between the response and the features.

However, Random Forests don’t really make these assumptions. Is including interaction terms important when fitting a Random Forest? Or will just including the individual terms and choosing appropriate parameters allow Random Forests to capture these relationships?


Although feature engineering is very important in real life, trees (and random forests) are very good at finding interaction terms of the form x*y. Here is a toy example of a regression with a two-way interaction. A naive linear model is compared with a tree and a bag of trees (which is a simpler alternative to a random forest.)

As you can see, the tree by itself is pretty good at finding the interaction but the linear model is no good in this example.

# fake data

x <- rnorm(1000, sd=3)
y <- rnorm(1000, sd=3)
z <- x + y + 10*x*y + rnorm(1000, 0, 0.2)
dat <- data.frame(x, y, z)

# test and train split
test <- sample(1:nrow(dat), 200)
train <- (1:1000)[-test]

# bag of trees model function
boot_tree <- function(formula, dat, N=100){
  models <- list()
  for (i in 1:N){
    models[[i]] <- rpart(formula, dat[sample(nrow(dat), nrow(dat), replace=T), ])
  class(models) <- "boot_tree"

# prediction function for bag of trees
predict.boot_tree <- function(models, newdat){
  preds <- matrix(0, nc=length(models), nr=nrow(newdat))
  for (i in 1:length(models)){
    preds[,i] <- predict(models[[i]], newdat)
  apply(preds, 1, function(x) mean(x, trim=0.1))

## Fit models and predict:

# linear model
model1 <- lm(z ~ x + y, data=dat[train,])
pred1 <- predict(model1, dat[test,])

# tree
model2 <- rpart(z ~ x + y, data=dat[train,])
pred2 <- predict(model2, dat[test,])

# bag of trees
model3 <- boot_tree("z ~ x+y", dat)
pred3 <- predict(model3, dat[test,])

ylim = range(c(pred1, pred2, pred3))

# plot predictions and true z

plot(dat$z[test], predict(model1, dat[test,]), pch=19, xlab="Actual z",
ylab="Predicted z", ylim=ylim)
points(dat$z[test], predict(model2, dat[test,]), col="green", pch=19)
points(dat$z[test], predict(model3, dat[test,]), col="blue", pch=19)

abline(0, 1, lwd=3, col="orange")

legend("topleft", pch=rep(19,3), col=c("black", "green", "blue"),
legend=c("Linear", "Tree", "Forest"))

enter image description here

Source : Link , Question Author : mt88 , Answer Author : Flounderer

Leave a Comment