How do gradients propagate in an unrolled recurrent neural network?

I’m trying to understand how rnn’s can be used to predict sequences by working through a simple example. Here is my simple network, consisting of one input, one hidden neuron, and one output:

The hidden neuron is the sigmoid function, and the output is taken to be a simple linear output. So, I think the network works as follows: if the hidden unit starts in state s, and we are processing a data point that is a sequence of length $3$, $(x_1, x_2, x_3)$, then:

At time 1, the predicted value, $p^1$, is

At time 2, we have

At time 3, we have

So far so good?

The “unrolled” rnn looks like this:

If we use a sum of square error term for the objective function, then how is it defined? On the whole sequence? In which case we would have something like $E=(p^1-x^1)^2+(p^2-x^2)^2+(p^3-x^3)^2$?

Are weights updated only once the entire sequence was looked at (in this case, the 3-point sequence)?

As for the the gradient with respect to the weights, we need to calculate $dE/dw, dE/dv, dE/du$, I will attempt to do simply by examining the 3 equations for $p^i$ above, if everything else looks correct. Besides doing it that way, this doesn’t look like vanilla back-propagation to me, because the same parameters appear in different layers of the network. How do we adjust for that?

If anyone can help guide me through this toy example, I would be very appreciative.

I think you need target values. So for the sequence $(x_1, x_2, x_3)$, you’d need corresponding targets $(t_1, t_2, t_3)$. Since you seem to want to predict the next term of the original input sequence, you’d need:

You’d need to define $x_4$, so if you had a input sequence of length $N$ to train the RNN with, you’d only be able to use the first $N-1$ terms as input values and the last $N-1$ terms as target values.

If we use a sum of square error term for the objective function, then how is it defined?

As far as I’m aware, you’re right – the error is the sum across the whole sequence. This is because the weights $u$, $v$ and $w$ are the same across the unfolded RNN.

So,

Are weights updated only once the entire sequence was looked at (in this case, the 3-point sequence)?

Yes, if using back propagation through time then I believe so.

As for the differentials, you won’t want to expand the whole expression out for $E$ and differentiate it when it comes to larger RNNs. So, some notation can make it neater:

• Let $z^t$ denote the input to the hidden neuron at time $t$ (i.e. $z^1 = ws + vx^1$)
• Let $y^t$ denote the output for the hidden neuron at time $t$ (i.e.
$y^1 = \sigma(ws + vx^1))$
• Let $y^0 = s$
• Let $\delta^t = \frac{\partial E}{\partial z^t}$

Then, the derivatives are:

Where $t \in [1,\ T]$ for a sequence of length $T$, and:

This recurrent relation comes from realising that the $t^{th}$ hidden activity not only effects the error at the $t^{th}$ output, $E^t$, but it also effects the rest of the error further down the RNN, $E - E^t$:

Besides doing it that way, this doesn’t look like vanilla back-propagation to me, because the same parameters appear in different layers of the network. How do we adjust for that?

This method is called back propagation through time (BPTT), and is similar to back propagation in the sense that it uses repeated application of the chain rule.

A more detailed but complicated worked example for an RNN can be found in Chapter 3.2 of ‘Supervised Sequence Labelling with Recurrent Neural Networks’ by Alex Graves – really interesting read!