Are there any general guidelines with respect to the input data characteristics, that can be used to decide between applying PCA versus LSA/LSI?
Brief summary of PCA vs. LSA/LSI:
Principle Component Analysis (PCA) and Latent Semantic Analysis (LSA) or Latent Semantic Indexing (LSI) are similar in the sense that all of them rely fundamentally on the application of the Singular Value Decomposition (SVD) to a matrix.
LSA and LSI are, as far as I can tell, the same thing. LSA differs from PCA not fundamentally, but in terms of the way the matrix entries are pre-processed prior to applying the SVD.
In LSA the preprocessing step typically involves normalizing a count matrix where columns correspond to ‘documents’ and rows corresponds to some kind of word. Entries can be thought of as some kind of (normalized) word-occurrence-for-document count.
In PCA the preprocessing step involves computing the covariance matrix from the original matrix. The original matrix is conceptually more ‘general’ in nature than in the case of LSA. Where PCA is concerned the columns are usually said to refer to generic sample vectors and the rows are said to refer to individual variables that are being measured. The covariance matrix is by definition square and symetric and in fact it is not necessary to apply the SVD, because the covariance matrix can be decomposed via diagonalization. Notably, the PCA matrix will almost certainly be denser than the LSA/LSI variant – zero entries will only occur where the covariance between to variables is zero, that is where the variables are independent.
Finally one more descriptive point that is made fairly frequently to distinguish the two is that
LSA seeks for the best linear subspace in Frobenius norm, while PCA aims for the best affine linear subspace.
In any case, the differences and similarities of these techniques have been hotly debated in various forums throughout the internets, and clearly there are some salient differences, and clearly these two techniques will produce different results.
Thus I repeat my question: Are there any general guidelines with respect to the input data characteristics, that can be used to decide between applying PCA versus LSA/LSI? If I have something resembling a term-document matrix will LSA/LSI always be the best choice? Might expect to get better results in some cases by preparing the term/doc matrix for LSA/LSI and then applying PCA to the result, instead of applying the SVD directly?
One difference I noted was that PCA can only give you either the term-term or Document-Document similarity (depending on how you multiplied the coreference matrix AA∗ or A∗A) but SVD/LSA can deliver both since you have eigenvectors of both AA∗ and A∗A. Actually I don’t see a reason to use PCA ever over SVD.