I am using single observation to compute losses using neural network implementation in PyTorch. I am confused in a small detail of SGD. If I compute loss and do loss.backward(), I am accumulating gradients.
If I do this on 100 observations and then run optimizer.step(), should I average out the gradients?

This is what I am doing as of now:

def compute_loss(training_data):
for data in training_data:
loss = F.mse_loss(data[0], data[1])
loss.backward()

def optimize(sample):
compute_loss(sample)
optimizer.step()


Should it be rather:

def compute_loss(training_data):
for data in training_data:
loss = F.mse_loss(data[0], data[1])
loss.backward(torch.Tensor([1.0/len(training_data)]))


The following assumes a loss function $$ff$$ that’s expressed as a sum, not an average. Expressing the loss as an average means that the scaling $$1n\frac{1}{n}$$ is “baked in” and no further action is needed. In particular, note that F.mse_loss uses reduction="mean" by default, so in the case of OP’s code, no further modification is necessary to achieve an average of gradients. Indeed, rescaling the gradients and using reduction="mean" does not accomplish the desired result and amounts to a reduction in the learning rate by a factor of $$1n\frac{1}{n}$$.

Suppose that $$G=∑ni=1∇f(xi)G = \sum_{i=1}^n \nabla f(x_i)$$ is the sum of the gradients for some minibatch with $$nn$$ samples. The SGD update with learning rate (step size) $$rr$$ is
$$x(t+1)=x(t)−rG. x^{(t+1)} = x^{(t)}- r G.$$

Now suppose that you use the mean of the gradients instead. This will change the update. If we use learning rate $$˜r\tilde{r}$$, we have
$$x(t+1)=x(t)−˜rnG. x^{(t+1)} = x^{(t)}- \frac{\tilde{r}}{n} G.$$
These expressions can be made to be equal by re-scaling either $$rr$$ or $$˜r\tilde{r}$$. So in that sense, the distinction between the mean and the sum is unimportant because $$rr$$ is chosen by the researcher in either case, and choosing a good $$rr$$ for the sum has an equivalent, rescaled $$˜r\tilde{r}$$ for the mean.

One reason to prefer using the mean, though, is that this de-couples the learning rate and the minibatch size, so that changing the number of samples in the minibatch will not implicitly change the learning rate.

Note that it’s standard to use the mean of the minibatch, rather than the entire training set. However, the same re-scaling argument above applies here, too — if you’re tuning the learning rate, for a fixed-size data set you’ll find a learning rate which works well, and this learning rate can be re-scaled to be suitable for a gradient descent that uses the sum in place of some mean.