# What is the difference between policy-based, on-policy, value-based, off-policy, model-free and model-based?

I’m trying to clear things out for myself, there are a lot of different categorizations within RL. Some people talk about:

• On-policy & Off-Policy
• Model-based & Model-free
• Model-based, Policy-based & Value-based (+ Actor-Critic= Policy-based+Value-based)

It seems like there is some overlap, which led me to the next understanding:

Model-based

Model-free:

• Policy-based = On-policy:
• Deterministic
• Stochastic
• Value-based = Off-Policy
• Actor-Critic = Value-based(Actor) + Policy-based(Critic)

Is this understanding right or are they all completely different categorizations?

Here is a quick summary on the Reinforcement Learning taxonomy:

### On-policy vs. Off-Policy

This division is based on whether you update your $$QQ$$ values based on actions undertaken according to your current policy or not. Let’s say your current policy is a completely random policy. You’re in state $$ss$$ and make an action $$aa$$ that leads you to state $$s′s'$$. Will you update your $$Q(s,a)Q(s, a)$$ based on the best possible action you can take in $$s′s'$$ or based on an action according to your current policy (random action)? The first choice method is called off-policy and the latter – on-policy. E.g. Q-learning does the first and SARSA does the latter.

### Policy-based vs. Value-based

In Policy-based methods we explicitly build a representation of a policy (mapping $$π:s→a\pi: s \to a$$) and keep it in memory during learning.

In Value-based we don’t store any explicit policy, only a value function. The policy is here implicit and can be derived directly from the value function (pick the action with the best value).

Actor-critic is a mix of the two.

### Model-based vs. Model-free

The problem we’re often dealing with in RL is that whenever you are in state $$ss$$ and make an action $$aa$$ you might not necessarily know the next state $$s′s'$$ that you’ll end up in (the environment influences the agent).

In Model-based approach you either have an access to the model (environment) so you know the probability distribution over states that you end up in, or you first try to build a model (often – approximation) yourself. This might be useful because it allows you to do planning (you can “think” about making moves ahead without actually performing any actions).

In Model-free you’re not given a model and you’re not trying to explicitly figure out how it works. You just collect some experience and then derive (hopefully) optimal policy.