Does rpart use multivariate splits by default?

I know that R’s rpart function keeps the data it would need to implement multivariate split, but I don’t know if it’s actually performing multivariate splits. I’ve tried researching it online looking at the rpart docs, but I don’t see any information that it can do it or is doing it. Anyone know for sure?

Answer

Rpart only provides univariate splits. I believe, based upon your question, that you are not entirely familiar with the difference between a univariate partitioning method and a multivariate partitioning method. I have done my best to explain this below, as well as provide some references for further research and to suggest some R packages to implement these methods.

Rpart is a tree based classifier that uses recursive partitioning. With partitioning methods you must define the points within your data at which a split is to be made. The rpart algorithm in R does this by finding the variable and the point which best splits (and thus reduces) the RSS. Because the splits only happen along one variable at a time, these are univariate splits. A Multivariate Split is typically defined as a simultaneous partitioning along multiple axis (hence multivariate), i.e. the first rpart node might split along Age>35, the second node might split along Income >25,000, and the third node might split along Cities west of the Mississippi. The second and third nodes are split on smaller subsets of the overall data, so in the second node the income criterion best splits the RSS only for those people who have an age of over 35, it does not apply to observations not found in this node, the same applies for the Cities criterion. One could continue doing this until there is a node for each observation in your dataset (rpart uses a minimum bucket size function in addition to a minimum node size criterion and a cp parameter which is the minimum the r-squared value must increase in order to continue fitting).

A multivariate method, such as Patient Rule Induction Method (the prim package in R) would simultaneously split by selecting, for example, All Observations where Income was Greater than 22,000, Age>32, and Cities West of Atlanta. The reason why the fit might be different is because the calculation for the fit is multivariate instead of univariate, the fit of these three criterion is calculated based upon the simultaneous fit of the three variables on all observations meeting these criterion rather than iteratively partitioning based upon univariate splits (as with rpart).

There are varying beliefs in regards to the effectiveness of univariate versus multivariate partitioning methods. Generally what I have seen in practice, is that most people prefer univariate partitioning (such as rpart) for explanatory purposes (it is only used in prediction when dealing with a problem where the structure is very well defined and the variation among the variables is fairly constant, this is why these are often used in medicine). Univariate tree models are typically combined with ensemble learners when used for prediction (i.e. a Random Forest). People who do use multivariate partitioning or clustering (which is very closely related to multivariate partitioning) often do so for complex problems that univariate methods fit very poorly, and do so mainly for prediction, or to group observations into categories.

I highly recommend Julian Faraway’s book Extending the Linear Model with R. Chapter 13 is dedicated entirely to the use of Trees (all univariate). If you’re interested further in multivariate methods, Elements of Statistical Learning by Hastie et. al, provides an excellent overview of many multivariate methods, including PRIM (although Friedman at Stanford has his original article on the method posted on his website), as well as clustering methods.

In regards to R Packages to utilize these methods, I believe you’re already using the rpart package, and I’ve mentioned the prim package above. There are various built in clustering routines, and I am quite fond of the party package mentioned by another person in this thread, because of its implementation of conditional inference in the decision tree building process. The optpart package lets you perform multivariate partitioning, and the mvpart package (also mentioned by someone else) lets you perform multivariate rpart trees, however I personally prefer using partDSA, which lets you combine nodes further down in your tree to help prevent partitioning of similar observations, if I feel rpart and party are not adequate for my modeling purposes.

Note: In my example of an rpart tree in paragraph 2, I describe how partitioning works with node numbers, if one were to draw out this tree, the partitioning would proceed to the left if the rule for the split was true, however in R I believe the split actually proceeds to the right if the rule is true.

Attribution
Source : Link , Question Author : chubbsondubs , Answer Author : Adam

Leave a Comment