I have a regression dataset where the features are on the order of ~ 400 variables and the dataset itself is around 300 samples. I tried to use Random Forest Regression (RFR) on the dataset and used either out-of-bag (oob) score or k-fold cv score to judge its performance. The kind of behavior I see right now that I’m trying to make sense of is that if I directly use RFR, no matter how many trees I use or what kind of parameter tuning I incorporate, I won’t get a good performance, whereas if I incorporate a PCA before RFR, I can run a grid search for the number of PCs before RFR and at around 8 or 9 PCs, the processing flow could provide a somewhat descent score. The score would rise and fall around this “optimal PC number” when I sweep the number of PCs.
I’m trying to make sense of this behavior as I tried to use the same processing flow on a couple of toy datasets I found and usually with or without PCA won’t change much for RFR performance. One of the concerns I had is that my dataset is a very noisy dataset, and most of the regression methods I tried so far won’t provide much good performance except for this PCA-RFR flow. So I’m not sure if this is a garbage-in-garbage-out situation where this PCA-RFR thing just somehow overfit my dataset. On the other hand, my features are quite collinear to each other and I don’t have that many data to train my model, so it kind of make sense that a PCA-preprocessing can help de-noise the dataset a bit and may also help to reduce the overfit of my training set with a smaller set of “reduced features”, but RFR is kind of new to me so I’m not aware if there is any theory behind all these.
If anyone has seen this before and have a good explanation or have any reference paper on PCA-RFR behavior, please let me know and I’d be very much grateful.
Using Random Forest in a dataset as the one you described has two major problems:
Random Forest does not perform well when features are monotonic trasnformation of other features (this makes the trees of the forest less independent from each other).
The same happens when you have more features than samples: random forest will probably overfit the dataset, and you will have a poor out of bag performance.
When using PCA you get rid of this two problems that are lowering the performance of Ranfom Forest:
- you reduce the number of features.
- you get rid of collinear
features. (all collinear features will end up in a single PCA