Recurrent Neural Network (RNN) topology: why always fully-connected?

I’ve started reading about Recurrent Neural Networks (RNNs) and Long Short Term Memory (LSTM) …(…oh, not enough rep points here to list references…)

One thing I don’t get: It always seems that neurons in each instance of a hidden layer get “fully connected” with every neuron in the previous instance of the hidden layer, rather than just being connected to the instance of their former self/selves (and maybe a couple others).

Is the fully-connectedness really necessary? Seems like you could save a lot of storage & execution time, and ‘lookback’ farther in time, if it isn’t necessary.

Here’s a diagram of my question…

rnn image: fully connected vs partially connected

I think this amounts to asking if it’s ok to keep only the diagonal (or near-diagonal) elements in the “W^hh” matrix of ‘synapses’ between the recurring hidden layer. I tried running this using a working RNN code (based on Andrew Trask’s demo of binary addition) — i.e., set all the non-diagonal terms to zero — and it performed terribly, but keeping terms near the diagonal, i.e. a banded linear system 3 elements wide — seemed to work as good as the fully-connected version. Even when I increased the sizes of inputs & hidden layer…. So…did I just get lucky?

I found a paper by Lai Wan Chan where he demonstrates that for linear activation functions, it’s always possible to reduce a network to “Jordan canonical form” (i.e. the diagonal and nearby elements). But no such proof seems available for sigmoids & other nonlinear activations.

I’ve also noticed that references to “partially-connected” RNNs just seem mostly to disappear after about 2003, and the treatments I’ve read from the past few years all seem to assume fully-connectedness. So…why is that?

EDIT: 3 years after this question was posted, NVIDIA released this paper, arXiv:1905.12340: “Rethinking Full Connectivity in Recurrent Neural
Networks”
, showing that sparser connections are usually just as accurate and much faster than fully-connected networks. The second diagram above corresponds to the “Diagonal RNN” in the arXiv paper. NVIDIA, I’d be happy to collaborate on future efforts… 😉

Answer

One reason might be due to the mathematical convenience. The vanilla recurrent neural network (Elman-type) can be formularized as:

ht=f(xt,ht1), where f() can be written as σ(Wxt+Uht1).

The above equation corresponds to your first picture. Of course you can make the recurrent matrix U sparse to restrict the connections, but that does not affect the core idea of the RNN.

BTW: There exists two kinds of memories in a RNN. One is the input-to-hidden weight W, which mainly stores the information from the input. The other is the hidden-to-hidden matrix U, which is used to store the histories. Since we do not know which parts of histories will affect our current prediction, the most reasonable way might be to allow all possible connetions and let the network learn for themself.

Attribution
Source : Link , Question Author : sh37211 , Answer Author : fishiwhj

Leave a Comment