How does minibatch gradient descent update the weights for each example in a batch?

If we process say 10 examples in a batch, I understand we can sum the loss for each example, but how does backpropagation work in regard to updating the weights for each example?

For example:

  • Example 1 –> loss = 2
  • Example 2 –> loss = -2

This results in an average loss of 0 (E = 0), so how would this update each weight and converge? Is it simply by the randomization of the batches that we “hopefully” converge sooner or later? Also doesn’t this only compute the gradient for the first set of weights for the last example processed?

Answer

Gradient descent doesn’t quite work the way you suggested but a similar problem can occur.

We don’t calculate the average loss from the batch, we calculate the average gradients of the loss function. The gradients are the derivative of the loss with respect to the weight and in a neural network the gradient for one weight depends on the inputs of that specific example and it also depends on many other weights in the model.

If your model has 5 weights and you have a mini-batch size of 2 then you might get this:

Example 1. Loss=2, \text{gradients}=(1.5,-2.0,1.1,0.4,-0.9)

Example 2. Loss=3, \text{gradients}=(1.2,2.3,-1.1,-0.8,-0.7)

The average of the gradients in this mini-batch are calculated, they are (1.35,0.15,0,-0.2,-0.8)

The benefit of averaging over several examples is that the variation in the gradient is lower so the learning is more consistent and less dependent on the specifics of one example. Notice how the average gradient for the third weight is 0, this weight won’t change this weight update but it will likely be non-zero for the next examples chosen which get computed with different weights.

edit in response to comments:

In my example above the average of the gradients is computed. For a mini-batch size of k where we calculate the loss L_i for each example we and aim to get the average gradient of the loss with respect to a weight w_j.

The way I wrote it in my example I averaged each gradient like: \frac{\partial L}{\partial w_j} = \frac{1}{k} \sum_{i=1}^{k} \frac{\partial L_i}{\partial w_j}

The tutorial code you linked to in the comments uses Tensorflow to minimize the average loss.

Tensorflow aims to minimize \frac{1}{k} \sum_{i=1}^{k} L_i

To minimize this it computes the gradients of the average loss with respect to each weight and uses gradient-descent to update the weights:

\frac{\partial L}{\partial w_j} = \frac{\partial }{\partial w_j} \frac{1}{k} \sum_{i=1}^{k} L_i

The differentiation can be brought inside the sum so it’s the same as the expression from the approach in my example.

\frac{\partial }{\partial w_j} \frac{1}{k} \sum_{i=1}^{k} L_i = \frac{1}{k} \sum_{i=1}^{k} \frac{\partial L_i}{\partial w_j}

Attribution
Source : Link , Question Author : carboncomputed , Answer Author : Hugh

Leave a Comment