Value iteration applied to FrozenLake

We can now apply value iteration to the FrozenLake game in order to compare the two DP algorithms and to see whether they converge to the same policy and value function.

Let's define eval_state_action as before to estimate the action state value for a state-action pair:

def eval_state_action(V, s, a, gamma=0.99):
return np.sum([p * (rew + gamma*V[next_s]) for p, next_s, rew, _ in env.P[s][a]])

Then, we create the main body of the value iteration algorithm:

def value_iteration(eps=0.0001):
V = np.zeros(nS)
it = 0
while True:
delta = 0
# update the value for each state
for s in range(nS):
old_v = V[s]
V[s] = np.max([eval_state_action(V, s, a) for a in range(nA)]) # equation 3.10
delta = max(delta, np.abs(old_v - V[s]))
# if stable, break the cycle
if delta < eps:
break
else:
print('Iter:', it, ' delta:', np.round(delta,5))
it += 1
return V

It loops until it reaches a steady value function (determined by the threshold, eps) and for each iteration, it updates the value of each state using formula (10).

As for the policy iteration, run_episodes executes some games to test the policy. The only difference is that in this case, the policy is determined at the same time that run_episodes is executed (for policy iteration, we defined the action for every state beforehand):

def run_episodes(env, V, num_games=100):
tot_rew = 0
state = env.reset()

for _ in range(num_games):
done = False

while not done:
# choose the best action using the value function
action = np.argmax([eval_state_action(V, state, a) for a in range(nA)]) #(11)
next_state, reward, done, _ = env.step(action)
state = next_state
tot_rew += reward
if done:
state = env.reset()

print('Won %i of %i games!'%(tot_rew, num_games))

Finally, we can create the environment, unwrap it, run the value iteration, and execute some test games:

env = gym.make('FrozenLake-v0')
env = env.unwrapped

nA = env.action_space.n
nS = env.observation_space.n

V = value_iteration(eps=0.0001)
run_episodes(env, V, 100)
print(V.reshape((4,4)))

The output will be similar to the following:

Iter: 0 delta: 0.33333
Iter: 1 delta: 0.1463
Iter: 2 delta: 0.10854
...
Iter: 128 delta: 0.00011
Iter: 129 delta: 0.00011
Iter: 130 delta: 0.0001
Won 86 of 100 games!
[[0.54083394 0.49722378 0.46884941 0.45487071]
[0.55739213 0. 0.35755091 0. ]
[0.5909355 0.64245898 0.61466487 0. ]
[0. 0.74129273 0.86262154 0. ]]

The value iteration algorithm converges after 130 iterations. The resulting value function and policy are the same as the policy iteration algorithm. 

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset