Why experience replay requires off-policy algorithm?

In the paper introducing DQN “Playing Atari with Deep Reinforcement Learning“, it mentioned:

Note that when learning by experience replay, it is necessary to learn off-policy
(because our current parameters are different to those used to generate the sample), which motivates
the choice of Q-learning.

I didn’t quite understand what it means. What if we use SARSA and remember the action a' for the action we are to take in s' in our memory, and then sample batches from it and update Q like we did in DQN? And, can actor-critic methods (A3C, for specific) use experience replay? If not, why?


The on-policy methods, like SARSA, expects that the actions in every state are chosen based on the current policy of the agent, that usually tends to exploit rewards.

Doing so, the policy gets better when we update our policy based on the last rewards. Here in particular, they update the parameters of the NN that predicts the value of a certain state/action).

But, if we update our policy based on stored transitions, like in experience replay, we are actually evaluating actions from a policy that is no longer the current one, since it evolved in time, thus making it no longer on-policy.

The Q values are evaluated based on the future rewards that you will get from a state following the current agent policy.

However, that is no longer true since you are now following a different policy. So they use a common off-policy method that explores based on an epsilon-greedy approach.

Source : Link , Question Author : DarkZero , Answer Author : Gulzar

Leave a Comment