Value iteration applied to FrozenLake

We can now apply value iteration to the FrozenLake game in order to compare the two DP algorithms and to see whether they converge to the same policy and value function.

Let's define eval_state_action as before to estimate the action state value for a state-action pair:

def eval_state_action(V, s, a, gamma=0.99):
    return np.sum([p * (rew + gamma*V[next_s]) for p, next_s, rew, _ in env.P[s][a]])

Then, we create the main body of the value iteration algorithm:

def value_iteration(eps=0.0001):
    V = np.zeros(nS)
    it = 0
    while True:
        delta = 0
        # update the value for each state
        for s in range(nS):
            old_v = V[s]
            V[s] = np.max([eval_state_action(V, s, a) for a in range(nA)]) # equation 3.10
            delta = max(delta, np.abs(old_v - V[s]))
        # if stable, break the cycle
        if delta < eps:
            break
        else:
            print('Iter:', it, ' delta:', np.round(delta,5))
        it += 1
    return V

It loops until it reaches a steady value function (determined by the threshold, eps) and for each iteration, it updates the value of each state using formula (10).

As for the policy iteration, run_episodes executes some games to test the policy. The only difference is that in this case, the policy is determined at the same time that run_episodes is executed (for policy iteration, we defined the action for every state beforehand):

def run_episodes(env, V, num_games=100):
    tot_rew = 0
    state = env.reset()

    for _ in range(num_games):
        done = False

        while not done:
            # choose the best action using the value function
            action = np.argmax([eval_state_action(V, state, a) for a in range(nA)]) #(11)
            next_state, reward, done, _ = env.step(action)
            state = next_state
            tot_rew += reward 
            if done:
                state = env.reset()

    print('Won %i of %i games!'%(tot_rew, num_games))

Finally, we can create the environment, unwrap it, run the value iteration, and execute some test games:

env = gym.make('FrozenLake-v0')
env = env.unwrapped

nA = env.action_space.n
nS = env.observation_space.n

V = value_iteration(eps=0.0001)
run_episodes(env, V, 100)
print(V.reshape((4,4)))

The output will be similar to the following:

Iter: 0 delta: 0.33333
Iter: 1 delta: 0.1463
Iter: 2 delta: 0.10854
...
Iter: 128 delta: 0.00011
Iter: 129 delta: 0.00011
Iter: 130 delta: 0.0001
Won 86 of 100 games!
[[0.54083394 0.49722378 0.46884941 0.45487071]
 [0.55739213 0.         0.35755091 0.        ]
 [0.5909355  0.64245898 0.61466487 0.        ]
 [0.         0.74129273 0.86262154 0.        ]]

The value iteration algorithm converges after 130 iterations. The resulting value function and policy are the same as the policy iteration algorithm.

Table of Contents for Value iteration applied to FrozenLake

Create new playlist

Sign In

Sign Up

Table of Contents for
Value iteration applied to FrozenLake