Policy iteration

Policy iteration cycles between policy evaluation, which updates  under the current policy, , using formula (8), and policy improvement (9), which computes  using the improved value function, . Eventually, after  cycles, the algorithm will result in an optimal policy, .

The pseudocode is as follows:

Initialize  and  for every state 

while is not stable:

> policy evaluation
while is not stable:
for each state s:


> policy improvement
for each state s:

After an initialization phase, the outer loop iterates through policy evaluation and policy iteration until a stable policy is found. On each of these iterations, policy evaluation evaluates the policy found during the preceding policy improvement steps, which in turn use the estimated value function.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset