How does gradient boosting calculate probability estimates?

I have been trying to understand gradient boosting reading various blogs, websites and trying to find my answer by looking through for example the XGBoost source code. However, I cannot seem to find an understandable explanation of how gradient boosting algorithms produce probability estimates. So, how do they calculate the probabilities?

Answer

TL;DR: The log-odds for a sample is the sum of the weights of its terminal leafs. The probability of the sample belonging to class 1 is the inverse-logit transformation of the sum.


Analogously to logistic regression, the logistic function computes probabilities that are linear on the logit scale:

$$
z = Xw \\
\mathbb{P}(y=1|X) = \frac{1}{1 + \exp(-z)}
$$

Unlike logistic regression, the “features” in $X$ are constructed as the terminal nodes of an ensemble of decision trees using the boosting procedure. Each row of $X$ collects the terminal leafs for each sample; the row is a $T$-hot binary vector, for $T$ the number of trees. (Each XGBoost tree is generated according to a particular algorithm, but that’s not relevant here.)

There are $n$ columns in $X$, one column for each terminal node. There is no expression for the total number of terminal nodes, because the number of nodes can vary between trees (and usually does, in my experience).

Each leaf in the tree has an associated “weight.” That weight is recorded in $w$. To be conformable with $X$, there are $n$ elements in $w$. The weights themselves are derived from the gradient boosting procedure; see: In XGboost are weights estimated for each sample and then averaged

Attribution
Source : Link , Question Author : Icyeval , Answer Author : Sycorax

Leave a Comment