In a context of decision tree splitting, it is not obvious to see why the Gini impurity
i(t)=1−k∑j=1p2(j|t)
is a measure of node t impurity. Is there an easy explanation of this?
Answer
Imagine an experiment with k possible output categories. Category j has a probability of occurrence p(j|t) (where j=1,..k)
Reproduce the experiment two times and make these observations:
- the probability of obtaining two identical outputs of category j is p2(j|t)
- the probability of obtaining two identical outputs, independently of their category, is: k∑j=1p2(j|t)
- the probability of obtaining two different outputs is thus: 1−k∑j=1p2(j|t)
That’s it: the Gini impurity is simply the probability of obtaining two different outputs, which is an “impurity measure”.
Remark: another expression of the Gini index is:
k∑j=1pj(1−pj)
This is the same quantity:
k∑j=1pj(1−pj)=(k∑j=1pj)−(k∑j=1p2j)=1−k∑j=1p2j
Attribution
Source : Link , Question Author : Picaud Vincent , Answer Author : Picaud Vincent