# A simple & clear explanation of the Gini impurity?

In a context of decision tree splitting, it is not obvious to see why the Gini impurity

is a measure of node t impurity. Is there an easy explanation of this?

Imagine an experiment with $$kk$$ possible output categories. Category $$jj$$ has a probability of occurrence $$p(j|t)p(j|t)$$ (where $$j=1,..kj=1,..k$$)

Reproduce the experiment two times and make these observations:

• the probability of obtaining two identical outputs of category $$jj$$ is $$p2(j|t) p^2(j|t)$$
• the probability of obtaining two identical outputs, independently of their category, is: $$k∑j=1p2(j|t)\sum\limits_{j=1}^k p^2(j|t)$$
• the probability of obtaining two different outputs is thus: $$1−k∑j=1p2(j|t)1-\sum\limits_{j=1}^k p^2(j|t)$$

That’s it: the Gini impurity is simply the probability of obtaining two different outputs, which is an “impurity measure”.

Remark: another expression of the Gini index is:
$$k∑j=1pj(1−pj) \sum\limits_{j=1}^k p_j(1-p_j)$$
This is the same quantity:
$$k∑j=1pj(1−pj)=(k∑j=1pj)−(k∑j=1p2j)=1−k∑j=1p2j \sum\limits_{j=1}^k p_j(1-p_j) = \left(\sum\limits_{j=1}^k p_j \right) -\left( \sum\limits_{j=1}^k p^2_j \right) = 1 - \sum\limits_{j=1}^k p^2_j$$