Reinforcement Learning: An Introduction.
Second edition, in progress., Richard S. Sutton and Andrew G. Barto (c) 2012, pp. 67-68.
Solving a reinforcement learning task means, roughly, finding a policy
that achieves a lot of reward over the long run. For finite MDPs, we
can precisely define an optimal policy in the following way. Value
functions define a partial ordering over policies. A policy π is
defined to be better than or equal to a policy π′ if its expected
return is greater than or equal to that of π′, for all states. In other
words, π≥π′ if and only if vπ(s)≥vπ′(s), for all s∈S. There is
always at least one policy that is better than or equal to all other
policies. This is an optimal policy.
Why is there always at least one policy that is better than or equal to all other policies?
Just past the quoted part, the same paragraph actually tells you what this policy is: it is the one that takes the best action in every state. In an MDP, the action we take in one state does not affect rewards for actions taken in others, so we can simply maximize the policy state-by-state.