I’m doing regression using Random Forests for predicting prices based on several attributes. Code is written in Python using Scikit-learn.
How do you decide whether you should transform your variables using
logbefore using it to fit the regression model? Is it necessary when using an Ensemble approach such as Random Forest?
The way Random Forests are built is invariant to monotonic transformations of the independent variables. Splits will be completely analogous. If you are just aiming for accuracy you will not see any improvement in it. In fact, since Random Forests are able to find complex non-linear (Why are you calling this linear regression?) relations and variable interactions on the fly, if you transform your independent variables you may smooth out the information that allows this algorithm to do this properly.
Sometimes Random Forests are not treated as a black box and used for inference. For example, you can interpret the variable importance measures that it provides, or calculate some sort of marginal effects of your independent variable on your dependent variable. This is usually visualized as partial dependence plots. I’m pretty sure this last thing is highly influenced by the scale of the variables, which is a problem when trying to obtain information of a more descriptive nature from Random Forests. In this case it might help you to transform your variables (standardize), which could make partial dependence plots comparable. Not completely sure on this, will have to think on it.
Not long ago I tried to predict count data using a Random Forest, regressing on the square root and the natural log of the dependant variable helped a bit, not much, and not enough to let me keep the model.
Some packages with which you may use random forests for inference: