Why do RNNs have a tendency to suffer from vanishing/exploding gradient?

Why do recurrent neural networks (RNNs) have a tendency to suffer from vanishing/exploding gradient?

For what a vanishing/exploding gradient is, see Pascanu, et al. (2013). On the difficulty of training recurrent neural networks, section 2 (pdf).

Answer

TL;DR

The main reasons are the following traits of BPTT:

  1. An unrolled RNN tends to be a very deep network.
  2. In an unrolled RNN the gradient in an early layer is a product that (also) contains many instances of the same term.

Long Version

To train an RNN, people usually use backpropagation through time (BPTT), which means that you choose a number of time steps $N$, and unroll your network so that it becomes a feedforward network made of $N$ duplicates of the original network, while each of them represents the original network in another time step.

unrolling an RNN
(image source: wikipedia)

So BPTT is just unrolling your RNN, and then using backpropagation to calculate the gradient (as one would do to train a normal feedforward network).

Cause 1: The unrolled network is usually very deep

Because our feedforward network was created by unrolling, it is $N$ times as deep as the original RNN. Thus the unrolled network is often very deep.

In deep feedforward neural networks, backpropagation has “the unstable gradient problem”, as Michael Nielsen explains in the chapter Why are deep neural networks hard to train? (in his book Neural Networks and Deep Learning):

[…] the gradient in early layers is the product of terms from all the later layers. When there are many layers, that’s an intrinsically unstable situation. The only way all layers can learn at close to the same speed is if all those products of terms come close to balancing out.

I.e. the earlier the layer, the longer the product becomes, and the more unstable the gradient becomes. (For a more rigorous explanation, see this answer.)

Cause 2: The product that gives the gradient contains many instances of the same term

The product that gives the gradient includes the weights of every later layer.
So in a normal feedforward neural network, this product for the $d^{\text{th}}$-to-last layer might look like: $$w_1\cdot\alpha_{1}\cdot w_2\cdot\alpha_{2}\cdot\ \cdots\ \cdot w_d\cdot\alpha_{d}$$
Nielsen explains that (with regard to absolute value) this product tends to be either very big or very small (for a large $d$).

But in an unrolled RNN, this product would look like: $$w\cdot\alpha_{1}\cdot w\cdot\alpha_{2}\cdot\ \cdots\ \cdot w\cdot\alpha_{d}$$
as the unrolled network is composed of duplicates of the same network.

Whether we are dealing with numbers or matrices, the appearance of the same term $d$ times means that the product is much more unstable (as the chances are much smaller that “all those products of terms come close to balancing out”).

And so the product (with regard to absolute value) tends to be either exponentially small or exponentially big (for a large $d$).

In other words, the fact that the unrolled RNN is composed of duplicates of the same network makes the unrolled network’s “unstable gradient problem” more severe than in a normal deep feedforward network.

Attribution
Source : Link , Question Author : user70394 , Answer Author : Community

Leave a Comment