PCA on high-dimensional text data before random forest classification?

Does it make sense to do PCA before carrying out a Random Forest Classification?

I’m dealing with high dimensional text data, and I want to do feature reduction to help avoid the curse of dimensionality, but don’t Random Forests already to some sort of dimension reduction?


Leo Breiman wrote that “dimensionality can be a blessing”. In general, random forests can run on large data sets without problems. How large is your data? Different fields handle things in different ways depending on subject-matter knowledge. For example, in gene expression studies genes are often discarded based on low variance (no peeking at the outcome) in a process sometimes called non-specific filtering. This can help with the running time on random forests. But it is not required.

Sticking with the gene expression example, sometimes analysts use PCA scores to represent gene expression measurements. The idea is to replace similar profiles with one score that is potentially less messy. Random forests can be run both on the original variables or the PCA scores (a surrogate for the variables). Some have reported better results with this approach, but there are no good comparisons to my knowledge.

In sum, there is no need to do PCA before running RF. But you can. The interpretation could change depending on your goals. If all you want to do is predict, the interpretation may be less important.

Source : Link , Question Author : Maus , Answer Author : Sycorax

Leave a Comment