I’m very confused about the difference between Information gain and mutual information. to make it even more confusing is that I can find both sources defining them as identical and other which explain their differences:

Information gain and Mutual information are the same:

- Feature Selection: Information Gain VS Mutual Information
An introduction to information retrieval: “Show that mutual information and information gain are equivalent”, page 285, exercise 13.13.- It is thus known as the information gain, or more commonly the mutual information between X and Y” –>
CS769 Spring 2010 Advanced Natural Language Processing, “Information Theory”, lecturer: Xiaojin Zhu- “Information gain is also called expected mutual

information” –>“Feature Selection Methods for Text Classification”,

Nicolette Nicolosi,

http://www.cs.rit.edu/~nan2563/feature_selection.pdfThey are different:

- https://math.stackexchange.com/questions/833713/equality-of-information-gain-and-mutual-information
- yang –>
“A comparative study on Feature Selection in Text Categorization”–> they are treated separately and mutual information is even discarded because it performs very bad compared to IG- citing yang –>
“An Extensive Empirical Study of— http://www.jmlr.org/papers/volume3/forman03a/forman03a_full.pdf

Feature Selection Metrics for Text Classification”little bit of confusion

I could still find other sources defending opposite thesis but I think these are enough. Can anyone enlighten me about the real difference / equality of these two measures?

## EDIT: other related question

**Answer**

There are two types of Mutual Information:

*Pointwise*Mutual Information and*Expected*Mutual Information

The *pointwise* Mutual Information between the values of two random variables can be defined as:

pMI(x;y):=logp(x,y)p(x)p(y)

The *expected* Mutual Information between two random variables X and Y can be defined as as the Kullback-Leiber Divergence between p(X,Y) and p(X)p(Y):

eMI(X;Y):=∑x,yp(x,y)logp(x,y)p(x)p(y)

Sometimes you find the definition of Information Gain as I(X;Y):=H(Y)−H(Y∣X) with the Entropy H(Y) and the conditional entropy H(Y∣X)

, so

I(X;Y)=H(Y)−H(Y∣X)=−∑yp(y)logp(y)+∑x,yp(x)p(y∣x)logp(y∣x)=∑x,yp(x,y)logp(y∣x)−∑y(∑xp(x,y))logp(y)=∑x,yp(x,y)logp(y∣x)−∑x,yp(x,y)logp(y)=∑x,yp(x,y)logp(y∣x)p(y)=∑x,yp(x,y)logp(y∣x)p(x)p(y)p(x)=∑x,yp(x,y)logp(x,y)p(y)p(x)=eMI(X;Y)

Note: p(y)=∑xp(x,y)

So expected Mutual Information and Information Gain are the same (with both definitions above).

**Attribution***Source : Link , Question Author : jcsun , Answer Author : chris elgoog*