Adaboost is an ensemble method that combines many weak learners to form a strong one. All of the examples of adaboost that i have read use decision stumps/trees as weak learners. Can i use different weak learners in adaboost? For example, how to implement adaboost (generally boosting) to boost a logistic regression model?
One main difference of classification trees and logistic regression is that the former outputs classes (-1,1) while the logistic regression outputs probs. One idea is to choose the best feature X from a set of features and pick up a threshold (0.5?) to convert the probs to classes and then use a weighted logistic regression to find the next feature etc.
But i imagine that there exists a general algorithm to boost different weak learners different than decision stumps that outputs probabilities. I believed that Logitboost is the answer to my question but i tried to read the “Additive Logistic Regression” paper and got stuck in the middle.
Don’t confuse the handling of the predictors (via base learners, e.g. stumps) and the handling of the loss function in boosting. Although AdaBoost can be thought of as finding combinations of base learners to minimize misclassification error, the “Additive Logistic Regression” paper you cite shows that it can also be formulated to minimize an exponential loss function. This insight opened up the boosting approach to a wide class of machine-learning problems that minimize differentiable loss functions, via gradient boosting. The residuals that are fit at each step are pseudo-residuals calculated from the gradient of the loss function. Even if the predictors are modeled as binary stumps, the output of the model thus need not be a binary choice.
As another answer states, linear base learners might not work for boosting, but linear base learners are not required for “boosted regression” in either the standard or the logistic sense. Decidedly non-linear stumps can be combined as slow base learners to minimize appropriate loss functions. It’s still called “boosted regression” even though it is far from a standard regression model linear in the coefficients of the predictors. The loss function can be functionally the same for linear models and “boosted regression” models with stumps or trees as predictors. Chapter 8 of ISLR makes this pretty clear.
So if you want a logistic-regression equivalent to boosted regression, focus on the loss function rather than on the base learners. That’s what the LogitBoost approach in the paper you cite does: minimize a log-loss rather than the exponential loss implicit in adaboost. The Wikipedia AdaBoost page describes this difference.
Many participants in this site would argue that a log-odds/probability based prediction is highly preferable to a strict yes/no classification prediction, as the former more generally allows for different tradeoffs between the extra costs of false-positive and false-negative predictions. As the answer to your related question indicates, it is possible to obtain estimated probabilities from the strong classifier derived from AdaBoost, but LogitBoost may well give better performance.
Implementations of gradient boosting for classification can provide information on the underlying probabilities. For example, this page on gradient boosting shows how
sklearn code allows for a choice between deviance loss for logistic regression and exponential loss for AdaBoost, and documents functions to predict probabilities from the gradient-boosted model.