In a context of decision tree splitting, it is not obvious to see why the

Gini impurity

i(t)=1−k∑j=1p2(j|t)

is a measure ofnode t impurity. Is there an easy explanation of this?

**Answer**

Imagine an experiment with k possible output categories. Category j has a probability of occurrence p(j|t) (where j=1,..k)

Reproduce the experiment two times and make these observations:

- the probability of obtaining
**two identical outputs of category j**is p2(j|t) - the probability of obtaining
**two identical outputs, independently of their category**, is: k∑j=1p2(j|t) - the probability of obtaining
**two different outputs**is thus: 1−k∑j=1p2(j|t)

That’s it: the Gini impurity is simply the probability of obtaining two different outputs, which is an “impurity measure”.

**Remark:** another expression of the Gini index is:

k∑j=1pj(1−pj)

This is the same quantity:

k∑j=1pj(1−pj)=(k∑j=1pj)−(k∑j=1p2j)=1−k∑j=1p2j

**Attribution***Source : Link , Question Author : Picaud Vincent , Answer Author : Picaud Vincent*