I’m training a classification model with Random Forest to discriminate between 6 categories. My transactional data has approximately 60k+ observations and 35 variables. Here’s an example of how it approximately looks like.
_________________________________________________ |user_id|acquisition_date|x_var_1|x_var_2| y_vay | |-------|----------------|-------|-------|--------| |111 | 2013-04-01 | 12 | US | group1 | |222 | 2013-04-12 | 6 | PNG | group1 | |333 | 2013-05-05 | 30 | DE | group2 | |444 | 2013-05-10 | 78 | US | group3 | |555 | 2013-06-15 | 15 | BR | group1 | |666 | 2013-06-15 | 237 | FR | group6 |
Once the the model is created, I’d like to score observations from the last few week.
As there have been changes to the system, the more recent observations will resemble more closely the environment of the current observations that I’d like to predict. Hence, I want to create a weight variable so that the Random Forest would put more importance on the recent observations.
Does anyone know if the randomForest package in R able to handle weights per observation?
Also, can you please suggest what is a good method for creating the weight variable? For example, as my data is from 2013, I was thinking that I can take the month number from the date as weight. Does anyone see a problem with this method?
Many thanks in advance!
ranger package in R (pdf), which is relatively new, will do this. The ranger implementation of random forests has a
case.weights argument that takes a vector with individual case / observation weights.