The follow-the-regularized-leader proximal gradient descent uses this update step:
We are on the t+1 round, we have already seen t data points.
gs is the gradient for the s sample.
σs is a non-increasing learning rate, defined as ∑ts=1σs=√t
and finally λ1 is a regularization term.
Can you give a geometric/physical/other simple intuition for what are we doing with the first 2 terms? Does the first one represent some kind of momentum? Does the second one require that our new location be different than our previous locations?
Please be patient if this seems to you like an attempt in over-simplifying a heavy theory…
The paper shows that the simple gradient descent update rule can be written in a very similar way to the above rule.
The intuitive update rule of FOBOS (a gradient descent variant) is:
- gt is the gradient for the previous sample t – we want to move in a that direction as it decreases the loss of our hypothesis on that sample.
- However, we don’t want to change our hypothesis xt too much (for fear of predicting badly on examples we have already seen). μt is a step size for this sample, and it should make each step more conservative.
We can find where the derivative is 0 and get an explicit update rule:
The paper goes on to show that the same intuitive update rule above can be also written as:
Which is pretty similar to the FTRL-proximal formulation.
In fact, the gradient part (1st term) and the proximal strong convexity (3rd term) are the same, and these were the interesting parts for me.