What is the difference between policy-based, on-policy, value-based, off-policy, model-free and model-based?

I’m trying to clear things out for myself, there are a lot of different categorizations within RL. Some people talk about:

  • On-policy & Off-Policy
  • Model-based & Model-free
  • Model-based, Policy-based & Value-based (+ Actor-Critic= Policy-based+Value-based)

It seems like there is some overlap, which led me to the next understanding:



  • Policy-based = On-policy:
    • Deterministic
    • Stochastic
  • Value-based = Off-Policy
  • Actor-Critic = Value-based(Actor) + Policy-based(Critic)

Is this understanding right or are they all completely different categorizations?


Here is a quick summary on the Reinforcement Learning taxonomy:

On-policy vs. Off-Policy

This division is based on whether you update your Q values based on actions undertaken according to your current policy or not. Let’s say your current policy is a completely random policy. You’re in state s and make an action a that leads you to state s. Will you update your Q(s,a) based on the best possible action you can take in s or based on an action according to your current policy (random action)? The first choice method is called off-policy and the latter – on-policy. E.g. Q-learning does the first and SARSA does the latter.

Policy-based vs. Value-based

In Policy-based methods we explicitly build a representation of a policy (mapping π:sa) and keep it in memory during learning.

In Value-based we don’t store any explicit policy, only a value function. The policy is here implicit and can be derived directly from the value function (pick the action with the best value).

Actor-critic is a mix of the two.

Model-based vs. Model-free

The problem we’re often dealing with in RL is that whenever you are in state s and make an action a you might not necessarily know the next state s that you’ll end up in (the environment influences the agent).

In Model-based approach you either have an access to the model (environment) so you know the probability distribution over states that you end up in, or you first try to build a model (often – approximation) yourself. This might be useful because it allows you to do planning (you can “think” about making moves ahead without actually performing any actions).

In Model-free you’re not given a model and you’re not trying to explicitly figure out how it works. You just collect some experience and then derive (hopefully) optimal policy.

Source : Link , Question Author : Dave Ouds , Answer Author : Tomasz Bartkowiak

Leave a Comment