From wikipedia: https://en.wikipedia.org/wiki/Decision_tree_learning
I am unable to get my head around two of the steps:
The first equation: $f_i(1 – f_i)$. This does not immediately become apparent as the “probability of being chosen times the probability of miscategorization”. Instead it looks to me like “probability of being chosen times the probability of others being chosen” (but not necessarily incorrectly)
The arithmetic of the last simplification eludes me: how to get from $1 – \sum(f_i^2)$ to $\sum(f_if_k)$
I don’t know about the algebra, but you can prove the identity with a probabilistic argument. If I roll two dice with $m$ sides and the probability of side $i$ is $f_i$, then the probability of a double is $\sum f_i^2$. Thus $1-\sum f_i^2$ is the probability that I roll distinct values. But arguing differently, the probability, say, that I get $i$ followed by $j$ is $f_if_j$. Summing over all possibilities, with $i \neq j$, I get the probability of rolling distinct outcomes: $\sum f_if_j$, and the identity is proven.
As for the first point, If you role the $m$ sided die, there is a probability $f_i$ that side $i$ comes up. Suppose I have to guess the value, and I do this by rolling a die of my own with the same weights. The probability that I guess wrong, conditional on value $i$ being true, is $1-f_i$. The probability that I get it wrong, summing over the possible values, is $\sum f_i(1-f_i)$.