# Is the sum of two decision trees equivalent to a single decision tree?

Suppose we have two regression trees (tree A and tree B) that map input $x \in \mathbb{R}^d$ to output $\hat{y} \in \mathbb{R}$. Let $\hat{y} = f_A(x)$ for tree A and $f_B(x)$ for tree B. Each tree uses binary splits, with hyperplanes as the separating functions.

Now, suppose we take a weighted sum of the tree outputs:

Is the function $f_C$ equivalent to a single (deeper) regression tree? If the answer is “sometimes”, then under what conditions?

Ideally, I’d like to allow oblique hyperplanes (i.e. splits performed on linear combinations of features). But, assuming single-feature splits could be ok if that’s the only answer available.

Example

Here are two regression trees defined on a 2d input space:

The figure shows how each tree partitions input space, and the output for each region (coded in grayscale). Colored numbers indicate regions of input space: 3,4,5,6 correspond to leaf nodes. 1 is the union of 3 & 4, etc.

Now suppose we average the output of trees A and B:

The average output is plotted on the left, with the decision boundaries of trees A and B superimposed. In this case, it’s possible to construct a single, deeper tree whose output is equivalent to the average (plotted on the right). Each node corresponds to a region of input space that can be constructed out of the regions defined by trees A and B (indicated by colored numbers on each node; multiple numbers indicate the intersection of two regions). Note that this tree is not unique–we could have started building from tree B instead of tree A.

This example shows that there exist cases where the answer is “yes”. I’d like to know whether this is always true.

Yes, the weighted sum of a regression trees is equivalent to a single (deeper) regression tree.

Universal function approximator

A regression tree is a universal function approximator (see e.g. cstheory). Most research on universal function approximations is done on artifical neural networks with one hidden layer (read this great blog). However, most machine learning algorithms are universal function approximations.

Being a universal function approximator means that any arbitrary function can be approximately represented. Thus, no matter how complex the function gets, a universal function approximation can represent it with any desired precision. In case of a regression tree you can imagine an infinitely deep one. This infinitely deep tree can assign any value to any point in space.

Since a weighted sum of a regression tree is another arbitrary function, there exists another regression tree that represents that function.

An algorithm to create such a tree

Knowing that such a tree exist is great. But we also want a recipe to create them. Such and algorithm would combine two given trees $T_1$ and $T_2$ and create a new one. A solution is copy-pasting $T_2$ at every leaf node of $T1$. The output value of the leaf nodes of the new tree is then the weighted sum of the leaf node of $T1$ (somewhere in the middle of the tree) and the leaf node of $T2$.

The example below shows two simple trees that are added with weight 0.5. Note that one node will never be reached, because there does not exist a number that is smaller than 3 and larger than 5. This indicates that these trees can be improved, but it does not make them invalid.

Why use more complex algorithms

An interesting additional question was raised by @usεr11852 in the comments: why would we use boosting algorithms (or in fact any complex machine learning algorithm) if every function can be modelled with a simple regression tree?

Regression trees can indeed represent any function but that is only one criteria for a machine learning algorithm. One important other property is how well they generalise. Deep regression trees are prone to overfitting, i.e. they do not generalise well. A random forest averages lots of deep trees to prevent this.