Policy iteration applied to FrozenLake

To consolidate the ideas behind policy iteration, we'll apply it to a game called FrozenLake. Here, the environment consists of a 4 x 4 grid. Using four actions that correspond to the directions (0 is left, 1 is down, 2 is right, and 3 is up), the agent has to move to the opposite side of the grid without falling in the holes. Moreover, movement is uncertain, and the agent has the possibility of movement in other directions. So, in such a situation, it could be beneficial not to move in the intended direction. A reward of +1 is assigned when the end goal is reached. The map of the game is shown in figure 3.4. S is the start position, the star is the end position, and the spirals are the holes:

Figure 3.4 Map of the FrozenLake game

With all the tools needed, let's see how to solve it.

All the code explained in this chapter is available on the GitHub repository of this book, using the following link: https://https://github.com/PacktPublishing/Reinforcement-Learning-Algorithms-with-Python

First, we have to create the environment, initializing the value function and the policy:

env = gym.make('FrozenLake-v0')
env = env.unwrapped
nA = env.action_space.n
nS = env.observation_space.n
V = np.zeros(nS)
policy = np.zeros(nS)

Then, we have to create the main cycle that does one step of policy evaluation and one step of policy improvement. This cycle finishes whenever the policy is stable. To do this, use the following code:

policy_stable = False
it = 0
while not policy_stable:
    policy_evaluation(V, policy)
    policy_stable = policy_improvement(V, policy)
    it += 1

In the end, we can print the number of iterations completed, the value function, the policy, and the score reached running some test games:

print('Converged after %i policy iterations'%(it))
run_episodes(env, V, policy)
print(V.reshape((4,4)))
print(policy.reshape((4,4)))

Now, before defining policy_evaluation, we can create a function to evaluate the expected action-value that will also be used in policy_improvement:

def eval_state_action(V, s, a, gamma=0.99):
    return np.sum([p * (rew + gamma*V[next_s]) for p, next_s, rew, _ in env.P[s][a]])

Here, env.P is a dictionary that contains all the information about the dynamics of the environment.

gamma is the discount factor, with 0.99 being a standard value to use for simple and medium difficulty problems. The higher it is, the more difficult it is for the agent to predict the value of a state because it should look further into the future.

Next, we can define the policy_evaluation function. policy_evaluation has to calculate formula (8) under the current policy for every state until it reaches steady values. Because the policy is deterministic, we only evaluate one action:

def policy_evaluation(V, policy, eps=0.0001):
    while True:
        delta = 0
        for s in range(nS):
            old_v = V[s]
            V[s] = eval_state_action(V, s, policy[s])
            delta = max(delta, np.abs(old_v - V[s]))
        if delta < eps:
            break

We consider the value function stable whenever delta is lower than the threshold, eps. When these conditions are met, the while loop statement is stopped.

policy_improvement takes the value function and the policy and iterates them across all of the states to update the policy based on the new value function:

def policy_improvement(V, policy):
    policy_stable = True
    for s in range(nS):
        old_a = policy[s]
        policy[s] = np.argmax([eval_state_action(V, s, a) for a in range(nA)])
        if old_a != policy[s]: 
            policy_stable = False
    return policy_stable

policy_improvement(V, policy) returns False until the policy changes. That's because it means that the policy isn't stable yet.

The final snippet of code runs some games to test the new policy and prints the number of games won:

def run_episodes(env, V, policy, num_games=100):
    tot_rew = 0
    state = env.reset()
    for _ in range(num_games):
        done = False
        while not done:
            next_state, reward, done, _ = env.step(policy[state])
            state = next_state
            tot_rew += reward 
            if done:
                state = env.reset()
    print('Won %i of %i games!'%(tot_rew, num_games))

That's it.

It converges in about 7 iterations and wins approximately 85% of games:

Figure 3.5 Results of the FrozenLake game. The optimal policy is on the left and the optimal state values are on the right

The policy resulting from the code is shown on the left of figure 3.5. You can see that it takes strange directions, but it's only because it follows the dynamics of the environment. On the right of figure 3.5, the final state's values are presented.

Table of Contents for Policy iteration applied to FrozenLake

Create new playlist

Sign In

Sign Up

Table of Contents for
Policy iteration applied to FrozenLake