How do gradients propagate in an unrolled recurrent neural network?

I’m trying to understand how rnn’s can be used to predict sequences by working through a simple example. Here is my simple network, consisting of one input, one hidden neuron, and one output:

enter image description here

The hidden neuron is the sigmoid function, and the output is taken to be a simple linear output. So, I think the network works as follows: if the hidden unit starts in state s, and we are processing a data point that is a sequence of length 3, (x1,x2,x3), then:

At time 1, the predicted value, p1, is

p1=u×σ(ws+vx1)

At time 2, we have

p2=u×σ(w×σ(ws+vx1)+vx2)

At time 3, we have

p3=u×σ(w×σ(w×σ(ws+vx1)+vx2)+vx3)

So far so good?

The “unrolled” rnn looks like this:

enter image description here

If we use a sum of square error term for the objective function, then how is it defined? On the whole sequence? In which case we would have something like E=(p1x1)2+(p2x2)2+(p3x3)2?

Are weights updated only once the entire sequence was looked at (in this case, the 3-point sequence)?

As for the the gradient with respect to the weights, we need to calculate dE/dw,dE/dv,dE/du, I will attempt to do simply by examining the 3 equations for pi above, if everything else looks correct. Besides doing it that way, this doesn’t look like vanilla back-propagation to me, because the same parameters appear in different layers of the network. How do we adjust for that?

If anyone can help guide me through this toy example, I would be very appreciative.

Answer

I think you need target values. So for the sequence (x1,x2,x3), you’d need corresponding targets (t1,t2,t3). Since you seem to want to predict the next term of the original input sequence, you’d need:
t1=x2, t2=x3, t3=x4

You’d need to define x4, so if you had a input sequence of length N to train the RNN with, you’d only be able to use the first N1 terms as input values and the last N1 terms as target values.

If we use a sum of square error term for the objective function, then how is it defined?

As far as I’m aware, you’re right – the error is the sum across the whole sequence. This is because the weights u, v and w are the same across the unfolded RNN.

So,
E=tEt=t(ttpt)2

Are weights updated only once the entire sequence was looked at (in this case, the 3-point sequence)?

Yes, if using back propagation through time then I believe so.

As for the differentials, you won’t want to expand the whole expression out for E and differentiate it when it comes to larger RNNs. So, some notation can make it neater:

  • Let zt denote the input to the hidden neuron at time t (i.e. z1=ws+vx1)
  • Let yt denote the output for the hidden neuron at time t (i.e.
    y1=σ(ws+vx1))
  • Let y0=s
  • Let δt=Ezt

Then, the derivatives are:

Eu=ytEv=tδtxtEw=tδtyt1

Where t[1, T] for a sequence of length T, and:

δt=σ(zt)(u+δt+1w)

This recurrent relation comes from realising that the tth hidden activity not only effects the error at the tth output, Et, but it also effects the rest of the error further down the RNN, EEt:

Ezt=Etytytzt+(EEt)zt+1zt+1ytytztEzt=ytzt(Etyt+(EEt)zt+1zt+1yt)Ezt=σ(zt)(u+(EEt)zt+1w)δt=Ezt=σ(zt)(u+δt+1w)

Besides doing it that way, this doesn’t look like vanilla back-propagation to me, because the same parameters appear in different layers of the network. How do we adjust for that?

This method is called back propagation through time (BPTT), and is similar to back propagation in the sense that it uses repeated application of the chain rule.

A more detailed but complicated worked example for an RNN can be found in Chapter 3.2 of ‘Supervised Sequence Labelling with Recurrent Neural Networks’ by Alex Graves – really interesting read!

Attribution
Source : Link , Question Author : Fequish , Answer Author : dok

Leave a Comment