I’m kind of new to datamining/machine learning/etc. and have been reading about a couple ways to combine multiple models and runs of the same model to improve predictions.
My impression from reading a couple papers (which are often interesting and great on theory and greek letters but short on code and actual examples) is that it’s supposed to go like this:
I take a model (
RF, etc) and get a list of classifiers between 0 and 1. My question is how to do combine each of these lists of classifiers? Do I run the same models on my training set so that the number of columns going into the final model are the same or is there some other trick?
It would be great if any suggestions/examples included R code.
NOTE: This is for a data set w/ 100k lines in the training set and 70k in the test set and 10 columns.
It actually boils down to one of the “3B” techniques: bagging, boosting or blending.
In bagging, you train a lot of classifiers on different subsets of object and combine answers by average for regression and voting for classification (there are some other options for more complex situations, but I’ll skip it). Vote proportion/variance can be interpreted as error approximation since the individual classifiers are usually considered independent. RF is in fact a bagging ensemble.
Boosting is a wider family of methods, however their main point is that you build next classifier on the residuals of the former, this way (in theory) gradually increasing accuracy by highlighting more and more subtle interactions. The predictions are thus usually combined by summing them up, something like calculating a value of a function in x by summing values of its Taylor series’ elements for x.
Most popular versions are (Stochastic) Gradient Boosting (with nice mathematical foundation) and AdaBoost (well known, in fact a specific case of GB). From a holistic perspective, decision tree is a boosting of trivial pivot classifiers.
Blending is an idea of nesting classifiers, i.e. running one classifier on an information system made of predictions of other classifiers. As so, it is a very variable method and certainly not a defined algorithm; may require a lot of objects (in most cases the “blender” classifier must be trained on a set of objects which were not used to build the partial classifiers to avoid embarrassing overfit).
The predictions of partial classifiers are obviously combined by melding them into an information system which is predicted by the blender.