A simple & clear explanation of the Gini impurity?

In a context of decision tree splitting, it is not obvious to see why the Gini impurity
i(t)=1kj=1p2(j|t)
is a measure of node t impurity. Is there an easy explanation of this?

Answer

Imagine an experiment with k possible output categories. Category j has a probability of occurrence p(j|t) (where j=1,..k)

Reproduce the experiment two times and make these observations:

  • the probability of obtaining two identical outputs of category j is p2(j|t)
  • the probability of obtaining two identical outputs, independently of their category, is: kj=1p2(j|t)
  • the probability of obtaining two different outputs is thus: 1kj=1p2(j|t)

That’s it: the Gini impurity is simply the probability of obtaining two different outputs, which is an “impurity measure”.


Remark: another expression of the Gini index is:
kj=1pj(1pj)
This is the same quantity:
kj=1pj(1pj)=(kj=1pj)(kj=1p2j)=1kj=1p2j

Attribution
Source : Link , Question Author : Picaud Vincent , Answer Author : Picaud Vincent

Leave a Comment