I’ve seen in a kaggle challenge about digit recognition someone who used PCA before decision tree or other techniques.

I thought it was just for compressing data but he aimed to improve his score.

How can PCA improve score in this case ? Is it because there is less overfitting ?

**Answer**

Dadi Perlmutter once said: “What is the difference between theory and practice? In theory they are the same while in practice they are different”. This is one of those cases.

Methods like Neural Networks often use gradient descent derived methods. In theory if you had infinite number of iterations and retries, the algorithm is going to converge to the same result independent of coordinate system. Neural Networks do not like the “curse of dimensionality” and so using PCA to reduce the dimension of the data can improve speed of convergence and quality of results. The transformation of the data, by centering, rotating and scaling informed by PCA can improve the convergence time and the quality of results.

In theory the PCA makes no difference, but in practice it improves rate of training, simplifies the required neural structure to represent the data, and results in systems that better characterize the “intermediate structure” of the data instead of having to account for multiple scales – it is more accurate.

My guess is that there are analogous reasons that apply to random forests of gradient boosted trees or other similar creatures. (Link)

**Attribution***Source : Link , Question Author : Jean , Answer Author : EngrStudent*