Policy evaluation and policy improvement

To find the optimal policy, you first need to find the optimal value function. An iterative procedure that does this is called policy evaluation—it creates a  sequence that iteratively improves the value function for a policy, , using the state value transition of the model, the expectation of the next state, and the immediate reward. Therefore, it creates a sequence of improving value functions using the Bellman equation:

This sequence will converge to the optimal value as . Figure 3.3 shows the update of  using the successive state values:

Figure 3.3. The update of  using formula (8)

The value function (8) can be updated only if the state transition function, p, and the reward function, r, for every state and action are known, so only if the model of the environment is completely known. 

Note that the first summation of the actions in (8) is needed for stochastic policies because the policy outputs a probability for each action. For simplicity from now on, we'll consider only deterministic policies. 

Once the value functions are improved, it can be used to find a better policy. This procedure is called policy improvement and is about finding a policy, , as follows:

It creates a policy, , from the value function, , of the original policy, . As can be formally demonstrated, the new policy, , is always better than , and the policy is optimal if and only if  is optimal. The combination of policy evaluation and policy improvement gives rise to two algorithms to compute the optimal policy. One is called policy iteration and the other is called value iteration. Both use policy evaluation to monotonically improve the value function and policy improvement to estimate the new policy. The only difference is that policy iteration executes the two phases cyclically, while value iteration combines them in a single update.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset