# Information gain and mutual information: different or equal?

I’m very confused about the difference between Information gain and mutual information. to make it even more confusing is that I can find both sources defining them as identical and other which explain their differences:

Information gain and Mutual information are the same:

• Feature Selection: Information Gain VS Mutual Information
• An introduction to information retrieval: “Show that mutual information and information gain are equivalent”, page 285, exercise 13.13.
• It is thus known as the information gain, or more commonly the mutual information between X and Y” –> CS769 Spring 2010 Advanced Natural Language Processing, “Information Theory”, lecturer: Xiaojin Zhu
• “Information gain is also called expected mutual
information” –> “Feature Selection Methods for Text Classification”,
Nicolette Nicolosi,
http://www.cs.rit.edu/~nan2563/feature_selection.pdf

They are different:

little bit of confusion

I could still find other sources defending opposite thesis but I think these are enough. Can anyone enlighten me about the real difference / equality of these two measures?

## EDIT: other related question

Information gain, mutual information and related measures

There are two types of Mutual Information:

• Pointwise Mutual Information and
• Expected Mutual Information

The pointwise Mutual Information between the values of two random variables can be defined as:
$$pMI(x;y):=logp(x,y)p(x)p(y) pMI(x;y) := \log \frac{p(x,y)}{p(x)p(y)}$$

The expected Mutual Information between two random variables $$XX$$ and $$YY$$ can be defined as as the Kullback-Leiber Divergence between $$p(X,Y)p(X,Y)$$ and $$p(X)p(Y)p(X)p(Y)$$:
$$eMI(X;Y):=∑x,yp(x,y)logp(x,y)p(x)p(y) eMI(X;Y) := \sum_{x,y} p(x, y) \log \frac{p(x, y)}{p(x)p(y)}$$

Sometimes you find the definition of Information Gain as $$I(X;Y):=H(Y)−H(Y∣X)I(X; Y) := H(Y) - H(Y \mid X)$$ with the Entropy $$H(Y)H(Y)$$ and the conditional entropy $$H(Y∣X)H(Y\mid X)$$
, so

I(X;Y)=H(Y)−H(Y∣X)=−∑yp(y)logp(y)+∑x,yp(x)p(y∣x)logp(y∣x)=∑x,yp(x,y)logp(y∣x)−∑y(∑xp(x,y))logp(y)=∑x,yp(x,y)logp(y∣x)−∑x,yp(x,y)logp(y)=∑x,yp(x,y)logp(y∣x)p(y)=∑x,yp(x,y)logp(y∣x)p(x)p(y)p(x)=∑x,yp(x,y)logp(x,y)p(y)p(x)=eMI(X;Y) \begin{align} I(X; Y) &= H(Y) - H(Y \mid X)\\ &= - \sum_y p(y) \log p(y) + \sum_{x,y} p(x) p(y\mid x) \log p(y\mid x)\\ &= \sum_{x,y} p(x, y) \log{p(y\mid x)} - \sum_{y} \left(\sum_{x}p(x,y)\right) \log p(y)\\ &= \sum_{x,y} p(x, y) \log{p(y\mid x)} - \sum_{x,y}p(x, y) \log p(y)\\ &= \sum_{x,y} p(x, y) \log \frac{p(y\mid x)}{p(y)}\\ &= \sum_{x,y} p(x, y) \log \frac{p(y\mid x)p(x)}{p(y)p(x)}\\ &= \sum_{x,y} p(x, y) \log \frac{p(x, y)}{p(y)p(x)}\\ &= eMI(X;Y) \end{align}

Note: $$p(y)=∑xp(x,y)p(y) = \sum_x p(x,y)$$

So expected Mutual Information and Information Gain are the same (with both definitions above).