When looking at the eigenvectors of the covariance matrix, we get the directions of maximum variance (the first eigenvector is the direction in which the data varies the most, etc.); this is called principal component analysis (PCA).
I was wondering what it would mean to look at the eigenvectors/values of the mutual information matrix, would they point in the direction of maximum entropy?
While it is not a direct answer (as it is about pointwise mutual information), look at paper relating word2vec to a singular value decomposition of PMI matrix:
- O. Levy, Y. Goldberg, Neural Word Embedding as Implicit Matrix Factorization
We analyze skip-gram with negative-sampling (SGNS), a word embedding
method introduced by Mikolov et al., and show that it is implicitly factorizing
a word-context matrix, whose cells are the pointwise mutual information (PMI) of
the respective word and context pairs, shifted by a global constant. We find that
another embedding method, NCE, is implicitly factorizing a similar matrix, where
each cell is the (shifted) log conditional probability of a word given its context.
We show that using a sparse Shifted Positive PMI word-context matrix to represent
words improves results on two word similarity tasks and one of two analogy tasks.
When dense low-dimensional vectors are preferred, exact factorization with SVD
can achieve solutions that are at least as good as SGNS’s solutions for word similarity
tasks. On analogy questions SGNS remains superior to SVD. We conjecture
that this stems from the weighted nature of SGNS’s factorization.