Why does XGBoost have a learning rate?

Original Question

Having used XGBoost a fair bit, clearly changing the learning rate dramatically affects the algorithm’s performance. That said, I really can’t understand the theoretical justification for it. It makes sense in “vanilla” gradient boosting, when you don’t make use of second derivatives. Analogously, one does not need a learning rate if using Newton-Raphson to perform optimisation by finding the zeros of the derivative of the cost function.

I thought it might have to do with ensuring that the updates one makes at every step are small, and thus the gradient expansion to second order is valid, but it seems to me like one could achieve the same thing more effectively by regularising?

Also, the XGBoost docs have a theoretical introduction to XGBoost and don’t mention a learning rate anywhere (https://xgboost.readthedocs.io/en/latest/tutorials/model.html)

Is it as simple as “it is experimentally observed to improve performance” and if so, is it possible to rationalise post-fact ?

Update: Almost a year on, I thought I’d update my thinking on this and somewhat refine my question

While it might be the case that the need for a learning rate was ascertained experimentally, it seems likely to me that the reason it is necessary, is to do with the fact that XGBOOST assumes the that the total loss L of a classifier consisting of an existing classifier Ft(x) plus a new classifier ft+1(x), can be written as a Taylor Expansion of L about Ft(x), which requires ft+1(x) to represent a “small enough” correction to Ft(x), that we don’t need to expand to too high an order.

My suspicion for a while has been that using lots of regularisation should take care of this, hence why use a learning rate at all? Another approach, could be to say that the tree ft+1(x), which splits the space into a number of distinct regions (terminal nodes) {Rj}, outputs a constant ϵwj in the jth region. By choosing a sufficiently small ϵ, we can ensure that ϵwj will be sufficiently small for any partitioning and any j.

It turns out however, that if you follow the derivation in the XGBOOST docs but take this approach, and use no regularisation, the weight wj you should assign to the region Rj is given by


in which L[Ft(x)+ft+1(x)]=Ni=1(yi,ˆyi)=Ni=1(yi,Ft(xi)+ft+1(xi))

In other words, if you state that the output of each tree at each leaf will be a constant wj multiplied by a very small number ϵ, small enough to ensure that the product is always small, wj will simply compensate, so the smaller you make ϵ, the larger you make wj, and the product remains unchanged. Crucially, the product will not necessarily be “small enough” for the Taylor series to quickly converge and justify the second order expansion. However, if a little bit of regularisation is used, enough to stop wj becoming infinite and thus ensure the product is always small, then you’re good.

In essence, you have two approaches:

  1. Set λ to be “very large”, this will ensure that wj is small, and thus the expansion is valid
  2. Use a learning rate parameter ϵ, and have some small amount of regularisation, to ensure that wj cannot be arbitrarily large

They sound the same, at a phenomenological level, but let’s investigate the wj they imply. Using approach 1, not having a learning rate, we obtain (as in the xgboost docs, linked above)


whereas if we use a learning rate as well, we obtain


They look very similar, and in both cases, as you up the amount of regularisation by increasing λ, the curvature term becomes less relevant. In the case when you have a learning rate, you can get this effect either by increasing λ or decreasing ϵ.

No matter which way I think about the problem, either approach seems conceptually the same, but they do give slightly different solutions. Furthermore, in practice, the learning rate is perhaps the most important hyperparamter to tune in XGBOOST, although I haven’t seen anybody explore whether similarly good results could be obtained by tuning the regularisation parameter more. In particular, am I missing something jumping out at me from these two equations?

Another Update: Another Year On

Thanks to Andreas for his answer below which got me onto this.

Because the loss function is assumed to be approximated by a function quadratic in wj, which is valid if wj is small, it will only have one minimum (assuming we’re doing loss minimisation). Thus the loss evaluated at ϵwj will be greater than the loss evaluated at wj, but less than the loss evaluated at wj=0, in other words, by updating your prediction by ϵwj, you are guaranteed to decrease your training loss. If ϵ is very small, this process happens very slowly but if ϵ is too large, then the Taylor series might not be valid. The key point here is that it’s not about finding the optimal wj, it’s about finding a wj that guarantees the training loss decreases at every iteration.

I think the logic must go something like this, but it can’t quite be this. While I agree that if we know wj, then ϵwj will also decrease training loss, but this logic seems circular to me. If we actually knew wj, then while we could multiply by ϵ, why would we?

Conversely, if we want to find the optimal wj subject to the assumption that wj is sufficiently small, it doesn’t seem correct to find the optimal wj assuming that wj is small, finding that it isn’t small, and then multiplying it by a small number to make it small.


In particular, am I missing something jumping out at me from these two equations?

From what I’ve looked at in Friedman’s paper, the ‘learning rate’ ϵ (there, called ‘shrinkage’ and denoted by ν) is applied after choosing those weights wj which minimise the cost function. That is, we determine the boost’s optimal weights, wj first, and only then do we consider multiplying by ϵ.

What would this mean?

This would mean that neither of the equations in the question which feature both ϵ and wj, are used in the XGBoost algorithm.

Also, that λ is still necessary in order to guarantee the Taylor expansion validity, and has a non-uniform effect on the wj, its effect depending on the partial derivatives of as you wrote before:

The learning rate doesn’t come in until after this point, when, having determined the optimal weights of the new tree {wj}Tj=1, we decide that, actually, we don’t want to add what we’ve just deemed to be the ‘optimal boost’ straight-up, but instead, update our additive predictor Ft by adding a scaled version of ft+1: scaling each weight wj uniformly by ϵ, and thus scaling the contribution of the whole of ft+1 by ϵ, too.

From where I’m sitting, there is some (weak-ish) analogy with the learning rate in gradient descent optimization: gently aggregating the predictors in order to iterate towards what we believe a general and descriptive predictor to be, but maintaining control over how fast we get there. In contrast, a high learning rate will mean that we use up all of our predictive power relatively quickly. If we do so too quickly with too few trees then any subsequent boost might need to make large corrections, causing the loss to remain at a relatively high plateau, after a few steps of which the algorithm terminates.

Keeping a lower learning rate, would aid generalisability because we are relying less upon the predictions of the new boosting tree, and instead permitting subsequent boosts to have more predictive power. It will mean that we need more boosts, and that training will take longer to terminate – in line with the empirical results shown in @Sycorax’s answer.

In summary:

My understanding is that:

  1. λ is used when regularising the weights {wj} and to justify the 2nd order truncation of the loss function’s Taylor expansion, enabling us to find the ‘optimal’ weights {wj}. This has a non-uniform effect on each of the weights wj.

  2. ϵ is used only after determination of the optimal weights wj and applied by scaling all of the weights uniformly to give ϵwj.

Source : Link , Question Author : gazza89 , Answer Author : montol

Leave a Comment