Random forest on multi-level/hierarchical-structured data

I am quite new to machine learning, CART-techniques and the like, and I hope my naivete isn’t too obvious.

How does Random Forest handle multi-level/hierarchical data structures (for example when cross-level interaction is of interest)?

That is, data sets with units of analysis at several hierarchical levels (e.g., students nested within schools, with data about both the students and the schools).

Just as an example, consider a multi-level data set with individuals on the first level (e.g., with data on voting behavior, demographics etc.) nested within countries at the second level (with country-level data; e.g., population):

ID voted age female country population
1 1 19 1 1 53.01
2 1 23 0 1 53.01
3 0 43 1 1 53.01
4 1 27 1 1 53.01
5 0 67 0 1 53.01
6 1 34 1 2 47.54
7 0 54 1 2 47.54
8 0 22 1 2 47.54
9 0 78 0 2 47.54
10 1 52 0 2 47.54

Lets say that voted is the response/dependent variable and the others are predictor/independent variables. In these types of cases, margins and marginal effects of a variable (partial dependence) for some higher-level variable (e.g., population) for different individual-level variables, etc., could be very interesting. In a case similar to this, glm is of course more appropriate — but when there are many variables, interactions and/or missing values, and/or very large-scale datasets etc., glm is not so reliable.

Subquestions: Can Random Forest explicitly handle this type of data structure in some way? If used regardless, what kind of bias does it introduce? If Random Forest is not appropriate, is there any other ensemble-type method that is?

(Question Random forest on grouped data is perhaps similar, but doesn’t really answer this.)

Answer

In a single classification tree, these groups are coded the same as any other categorical variable. This is often done as either binary coding or just using an integer. There are different arguments for using either. In random forests if you are using binary coding, some groups will be included/excluded for any given tree. So you may have an indicator for country_2 but not country_3. If you leave the group variable as an integer then the ordering can affect the outcome as well. What does it mean for country > 5 and country < 12? How does that change if you randomly re-label the countries with new integers?

At each step in growing a tree, the algorithm is looking for the split that optimizing the criteria. If there are large differences between groups then the grouping variable will be important, but if it is only moderately important and you prune a tree, then the variable may essentially excluded.

Like most other machine learning algorithms, CART and random forests do not necessarily account for dependency between observations within groups the way you would expect in a hierarchical regression model. If there is dependency between observations, it should be captured by the random forest algorithm through the generation of many trees that use the grouping variable. However if other variables demonstrate greater discrimination then the grouping variable may be ignored.

In your case, country and population are perfectly collinear. There is no information gained by using both variables in your model. So you can think about how a random forest model would treat these variables in your data.

Attribution
Source : Link , Question Author : Mikael Poul Johannesson , Answer Author : Ellis Valentiner

Leave a Comment