Mathematics behind classification and regression trees

Can anyone help explain some of the mathematics behind classification in CART? I’m looking to understand how two main stages happen. For instance I trained a CART classifier on a dataset and used a testing dataset to mark its predictive performance but:

  1. How is the initial root of the tree chosen?

  2. Why and how is each branch formed?

My dataset being 400 thousand records with 15 columns and 23 classes achieves a 100% accuracy from a confusion matrix, I use 10-fold crossvalidation on the dataset. I would be really greatful if anyone could help explain the stages of CART classification?


CART and decision trees like algorithms work through recursive partitioning of the training set in order to obtain subsets that are as pure as possible to a given target class. Each node of the tree is associated to a particular set of records $T$ that is splitted by a specific test on a feature. For example, a split on a continuous attribute $A$ can be induced by the test $ A \le x$. The set of records $T$ is then partitioned in two subsets that leads to the left branch of the tree and the right one.

$T_l = \{ t \in T: t(A) \le x \}$


$T_r = \{ t \in T: t(A) > x \}$

Similarly, a categorical feature $B$ can be used to induce splits according to its values. For example, if $B = \{b_1, \dots, b_k\}$ each branch $i$ can be induced by the test $B = b_i$.

The divide step of the recursive algorithm to induce decision tree takes into account all possible splits for each feature and tries to find the best one according to a chosen quality measure: the splitting criterion. If your dataset is induced on the following scheme

$$A_1, \dots, A_m, C$$

where $A_j$ are attributes and $C$ is the target class, all candidates splits are generated and evaluated by the splitting criterion. Splits on continuous attributes and categorical ones are generated as described above. The selection of the best split is usually carried out by impurity measures. The impurity of the parent node has to be decreased by the split. Let $(E_1, E_2, \dots, E_k)$ be a split induced on the set of records $E$, a splitting criterion that makes used of the impurity measure $I(\cdot)$ is:

$$\Delta = I(E) – \sum_{i=1}^{k}\frac{|E_i|}{|E|}I(E_i)$$

Standard impurity measures are the Shannon entropy or the Gini index. More specifically, CART uses the Gini index that is defined for the set $E$ as following. Let $p_j$ be the fraction of records in $E$ of class $c_j$
$$p_j = \frac{|\{t \in E:t[C] = c_j\}|}{|E|} $$ then
$$ \mathit{Gini}(E) = 1 – \sum_{j=1}^{Q}p_j^2$$
where $Q$ is the number of classes.

It leads to a 0 impurity when all records belong to the same class.

As an example, let’s say that we have a binary class set of records $T$ where the class distribution is $(1/2, 1/2)$ – the following is a good split for $T$

Good split

the probability distribution of records in $T_l$ is $(1,0)$ and the $T_r$’s one is $(0,1)$. Let’s say that $T_l$ and $T_r$ are the same size, thus $|T_l|/|T| = |T_r|/|T| = 1/2$. We can see that $\Delta$ is high:

$$\Delta = 1 – 1/2^2 – 1/2^2 – 0 – 0 = 1/2$$

The following split is worse than the first one and the splitting criterion $\Delta$ reflects this characteristic.
Bad split
$$\Delta = 1 – 1/2^2 – 1/2^2 – 1/2 \bigg( 1 – (3/4)^2 – (1/4)^2 \bigg) – 1/2 \bigg( 1 – (1/4)^2 – (3/4)^2 \bigg) = 1/2 – 1/2(3/8) – 1/2(3/8) = 1/8$$

The first split will be selected as best split and then the algorithm proceeds in a recursive fashion.

It is easy to classify a new instance with a decision tree, in fact it is enough to follow the path from the root node to a leaf. A record is classified with the majority class of the leaf that it reaches.

Say that we want to classify the square on this figure

Two feature dataset

that is the graphical representation of a training set induced on the scheme $A,B,C$, where $C$ is the target class and $A$ and $B$ are two continuous features.

A possible induced decision tree might be the following:
enter image description here

It is clear that the record square will be classified by the decision tree as a circle given that the record falls on a leaf labeled with circles.

In this toy example the accuracy on the training set is 100% because no record is mis-classified by the tree. On the graphical representation of the training set above we can see the boundaries (gray dashed lines) that the tree uses to classify new instances.

There is plenty of literature on decision trees, I wanted just to write down a sketchy introduction. Another famous implementation is C4.5.

Source : Link , Question Author : G Gr , Answer Author : Simone

Leave a Comment