Policy iteration cycles between policy evaluation, which updates under the current policy, , using formula (8), and policy improvement (9), which computes using the improved value function, . Eventually, after cycles, the algorithm will result in an optimal policy, .
The pseudocode is as follows:
Initialize and for every state
while is not stable:
> policy evaluation
while is not stable:
for each state s:
> policy improvement
for each state s:
After an initialization phase, the outer loop iterates through policy evaluation and policy iteration until a stable policy is found. On each of these iterations, policy evaluation evaluates the policy found during the preceding policy improvement steps, which in turn use the estimated value function.