In DeepMind’s 2015 paper on deep reinforcement learning, it states that “Previous attempts to combine RL with neural networks had largely failed due to unstable learning”. The paper then lists some causes of this, based on correlations across the observations.
Please could somebody explain what this means? Is it a form of overfitting, where the neural network learns some structure which is present in training, but may not be present at testing? Or does it mean something else?
The paper can be found: http://www.nature.com/nature/journal/v518/n7540/full/nature14236.html
And the section I am trying to understand is:
Reinforcement learning is known to be unstable or even to diverge when a nonlinear function approximator such as a neural network is used to represent the action-value (also known as Q) function. This instability has several causes: the correlations present in the sequence of observations, the fact that small updates to Q may significantly change the policy and therefore change the data distribution, and the correlations between the action-values and the target values.
We address these instabilities with a novel variant of Q-learning, which uses two key ideas. First, we used a biologically inspired mechanism termed experience replay that randomizes over the data, thereby removing correlations in the observation sequence and smoothing over changes in the data distribution. Second, we used an iterative update that adjusts the action-values (Q) towards target values that are only periodically updated, thereby reducing correlations with the target.
The main problem is that, as in many other fields, DNN can be hard to train. Here, one problem is the correlation of input data: if you think about a video game (they actually use those to test their algorithms), you can imagine that screenshots taken one step after another are highly correlated: the game evolves “continuously”. That, for NNs, can be a problem: doing many iterations of gradient descent on similar and correlated inputs may lead to overfit them and/or fall into a local minimum. This why they use experience replay: they store a series of “snapshots” of the game, then shuffle them, and pick them some steps later to do training. In this way, the data is not correlated anymore.
Then, they notice how during the training the Q values (predicted by the NN) can change the on going policy, so making the agent prefer only a set of actions and causing it to store data that is correlated for the same reasons as before: this is why they delay training and update Q periodically, to ensure that the agent can explore the game, and train on shuffled and uncorrelated data.