# How can it be trapped in a saddle point?

I am currently a bit puzzled by how mini-batch gradient descent can be trapped in a saddle point.

The solution might be too trivial that I don’t get it.

You get an new sample every epoch, and it computes a new error based on a new batch, so the cost function is only static for each batch, which means that the gradient also should change for each mini batch.. but according to this should a vanilla implementation have issues with saddle points?

Another key challenge of minimizing highly non-convex error functions
common for neural networks is avoiding getting trapped in their
numerous suboptimal local minima. Dauphin et al. [19] argue that the
difficulty arises in fact not from local minima but from saddle
points, i.e. points where one dimension slopes up and another slopes
down. These saddle points are usually surrounded by a plateau of the
same error, which makes it notoriously hard for SGD to escape, as the
gradient is close to zero in all dimensions.

I would mean that especially SGD would have clear advantage against saddle points, as it fluctuates towards its convergence… The fluctuations and the random sampling , and cost function being different for each epoch should be enough reasons for not becoming trapped in one.

For full batch gradient decent does it make sense that it can be trapped in saddle point, as the error function is constant.

I am a bit confused on the two other parts.

Take a look at the image below from Off Convex. In a convex function (leftmost image), there is only one local minimum, which is also the global minimum. But in a non-convex function (rightmost image), there may be multiple local minima and often joining two local minima is a saddle point. If you are approaching from a higher point, the gradient is comparatively flatter, and you risk getting stuck there, especially if you are moving only in one direction.

Now the thing is, whether you are optimizing using mini-batch or stochastic gradient descent, the underlying non-convex function is the same, and the gradient is a property of the this function. When doing mini-batch, you consider many samples at a time and take the gradient step averaged over all of them. This reduces variance. But if the average gradient direction is still pointing in the same direction as the saddle point, then you still risk getting stuck there. The analogy is, if you’re taking 2 steps forward and 1 step back, averaging over those, you ultimately end up taking 1 step forward.
If you perform SGD instead, you take all the steps one after the other, but if you’re still moving in a single direction, you can reach the saddle point and find that the gradient on all sides is fairly flat and the step size is too small to go over this flat part. This doesn’t have anything to do with whether you considered a bunch of points at once, or one by one in random order.

Take a look at the visualization here. Even with SGD, if the fluctuations occur only along one dimension, with the steps getting smaller and smaller, it would converge at the saddle point. In this case, the mini-batch method would just reduce the amount of fluctuation but would not be able to change the gradient’s direction.

SGD can sometimes break out of simple saddle points, if the fluctuations are along other directions, and if the step size is large enough for it to go over the flatness. But sometimes the saddle regions can be fairly complex, such as in the image below.

The way methods like momentum, ADAGRAD, Adam etc are able to break out of this, is by considering the past gradients. Consider momentum,

$$v_t = \gamma v_{t-1} + \eta \nabla_{theta} J(\theta)$$

which adds a portion of the last gradient, $v_{t-1}$. In case you’ve just been going back and forth in one direction, essentially changing signs, it ends up dampening your progress. While if there has consistently been positive progress in one direction, it ends up building up and going down that way.