I have a question concerning actor critic methods in reinforcement learning.
In these slides (https://hadovanhasselt.files.wordpress.com/2016/01/pg1.pdf) different types of actor-critics are explained. Advantage actor critic and TD actor critic are mentioned in the last slide:
But when I look at the slide “Estimating the advantage function (2)”, it is said, that the advantage function can be approximated by the td error. Then the update rule includes the td error the same way as in TD actor critic.
So is advantage actor critic and td actor critic actually the same? Or is there a difference I don’t see?
Advantage can be approximated by TD error. This may be helpful especially if you want to update θ after each transition.
For the batch approaches, you can calculate Qw(A,S) e.g. by means of fitted Q-iteration and subsequently V(S). Using this, you have the general advantage function and your gradient change of the policy may be much more stable because it will be closer to global/actual advantage function.