Is Deep-Q Learning inherently unstable

I’m reading Barto and Sutton’s Reinforcement Learning and in it (chapter 11) they present the “deadly triad”:

  1. Function approximation
  2. Bootstrapping
  3. Off-policy training

And they state that an algorithm which uses all 3 of these is unstable and liable to diverge in training. My thought is, doesn’t deep Q-learning hit all 3 of these? It certainly uses function approximation in the form of a deep neural network, it uses bootstrapping since it’s a form of Temporal Difference learning so its updates are based on future Q-values, and it uses off-policy training because its value updates utilizes the maximum of the future time-step Q-values whereas the policy being trained (the behavior policy) might not be a greedy algorithm.

It seems to me then that deep-Q learning should be inherently unstable. Is this true, or is my understanding wrong somewhere? If it is in fact inherently unstable, a follow up question would be, is it unstable in practice? I.e. is there a wide class of problems for which deep-Q learning would be unstable, or is it generally still fine to use deep-Q learning for the vast majority of problems but there are some small set of problems for which deep-Q learning might be unstable?


Given that tricks such as replay memory, gradient clipping, reward clipping, carefully selected rollout strategies, and the use of a target network are often necessary for achieving reasonable performance, and even then training can be unstable, yes, it seems to be true in practice.

That doesn’t mean it doesn’t work in practice — DeepMind’s Atari paper showed it is indeed possible, with the help of aforementioned tricks. However, it is fairly challenging and requires tens of millions of steps to train properly.

Source : Link , Question Author : enumaris , Answer Author : shimao

Leave a Comment