I understand that PCA is used for dimensionality reduction to be able to plot datasets in 2D or 3D. But I have also seen people applying PCA as a preprocessing step in classification scenarios where they apply PCA to reduce the number of features, then they use some Principal Components (the eigenvectors of the covariance matrix) as the new features.

My questions:

What effects does that do to the classification performance?

When to apply such a preprocessing step?

I have a dataset with 10 features as real numbers and 600 binary features that represent categorical features, using one-to-many encoding to represent them. Would applying PCA here make sense and make a better results?

p.s. if the question is too broad, I would be thankful if you provide a paper or tutorials that explains better the details of using PCA in that manner.

p.s. after reading a little, i found that it could be better to use Latent Semantic Analysis to reduce the number of binary features for the categorical attributes? So I don’t touch the real-valued features, but only preprocess the binary features and then combine the real-valued features with the new features and train my classifier. What do you think?

**Answer**

Using PCA for feature selection (removing non-predictive features) is an extremely expensive way to do it. PCA algos are often O(n^3). Rather a much better and more efficient approach would be to use a measure of inter-dependence between the feature and the class – for this Mutual Information tends to perform very well, furthermore it’s the only measure of dependence that a) fully generalizes and b) actually has a good philosophical foundation based on Kullback-Leibler divergence.

For example, we compute (using maximum likelihood probability approx with some smoothing)

MI-above-expected = MI(F, C) – E_{X, N}[MI(X, C)]

where the second term is the ‘expected mutual information given N examples’. We then take the top M features after sorting by MI-above-expected.

The reason why one would want to use PCA is if one expects that many of the features are in fact dependent. This would be particularly handy for Naive Bayes where independence is assumed. Now the datasets I’ve worked with have always been far too large to use PCA, so I don’t use PCA and we have to use more sophisticated methods. But if your dataset is small, and you don’t have the time to investigate more sophisticated methods, then by all means go ahead and apply an out-of-box PCA.

**Attribution***Source : Link , Question Author : Jack Twain , Answer Author : samthebest*