It is said that RNN suffers from vanishing gradients when facing long memory conditions, and LSTM is a solution because it enables keeping both long term and short term memories. But I can not understand two basic things:

1- Both outputs of an LSTM cell (cell state and hidden value) are calculated based on previous values of cell state, hidden values and input. Such a recursive operation will make both cell state and hidden variable having long memories. What is the difference?

2- How can we say LSTM reduces chance of vanishing gradients? If gates allow long memory, vanishing gradients will also happen. If they don’t and therefore they block long memory chains, how can we say we have long memory operation?

**Answer**

Regarding question (2), vanishing/exploding gradients happen in LSTMs too.

In vanilla RNNs, the gradient is a term that depends on a factor exponentiated to T (T is the number of steps you perform backpropagation) [1]. This means that values greater than 1 explode and values less than 1 shrink very fast.

On the other hand, gradients in LSTMs, do not have a term that is exponentiated to T [2]. Therefore, the gradient still shrinks/explodes, but at a lower rate than vanilla RNNs.

[1]: Pascanu, R., Mikolov, T., Bengio, Y., On the difficulty of training Recurrent Neural Networks, Feb. 2013 – https://arxiv.org/pdf/1211.5063.pdf

[2]: Bayer, Justin Simon. Learning Sequence Representations. Diss. München, Technische Universität München, Diss., 2015, 2015 – mentioned in https://stats.stackexchange.com/a/263956/191233

**Attribution***Source : Link , Question Author : Shahriar49 , Answer Author : joaoaccarvalho*