Applying SARSA to Taxi-v2

After a more theoretical view of TD learning and particularly of SARSA, we are finally able to implement SARSA to solve problems of interest. As we saw previously, SARSA can be applied to environments with unknown models and dynamics, but as it is a tabular algorithm with scalability constraints, it can only be applied to environments with small and discrete action and state spaces. So, we choose to apply SARSA to a gym environment called Taxi-v2 that satisfies all the requirements and is a good test bed for these kinds of algorithm.

Taxi-v2 is a game that was introduced to study hierarchical reinforcement learning (a type of RL algorithm that creates a hierarchy of policies, each with the goal of solving a subtask) where the aim is to pick up a passenger and drop them at a precise location. A reward of +20 is earned when the taxi performs a successful drop-off, but a penalty of -10 is incurred for illegal pickup or drop-off. Moreover, a point is lost for every timestep. The render of the game is given in figure 4.4. There are six legal moves corresponding to the four directions, the pickup, and the drop-off actions. In figure 4.4, the : symbol represents an empty location; the | symbol represents a wall that the taxi can't travel through; and R,G,Y,B are the four locations. The taxi, the yellow rectangle in the diagram, has to pick up a person in the location identified by the light blue color and drop them off in the location identified by the color violet.

Figure 4.4 Start state of the Taxi-v2 environment

The implementation is fairly straightforward and follows the pseudocode given in the previous section. Though we explain and show all the code here, it is also available on the GitHub repository of the book.

Let's first implement the main function, SARSA(..), of the SARSA algorithm, which does most of the work. After this, we'll implement a couple of auxiliary functions that perform simple but essential tasks.

SARSA needs an environment and a few other hyperparameters as arguments to work:

A learning rate, lr, previously called , that controls the amount of learning at each update.
num_episodes speaks for itself because it is the number of episodes that SARSA will execute before terminating.
eps is the initial value of the randomness of the -greedy policy.
gamma is the discount factor used to give less importance to actions more in the future.
eps_decay is the linear decrement of eps across episodes.

The first lines of code are as follows:

def SARSA(env, lr=0.01, num_episodes=10000, eps=0.3, gamma=0.95, eps_decay=0.00005):
    nA = env.action_space.n
    nS = env.observation_space.n
    test_rewards = []
    Q = np.zeros((nS, nA))
    games_reward = []

Here, some variables are initialized. nA and nS are the numbers of actions and observations respectively of the environment, Q is the matrix that will contain the Q-values of each state-action pair, and test_rewards and games_rewards are lists used later to hold information about the scores of the games.

Next, we can implement the main loop that learns the Q-values:

    for ep in range(num_episodes):
        state = env.reset()
        done = False
        tot_rew = 0

        if eps > 0.01:
            eps -= eps_decay

        action = eps_greedy(Q, state, eps)

Line 2 in the preceding code block resets the environment on each new episode and stores the current state of the environment. Line 3 initializes a Boolean variable that will be set to True when the environment is in a terminal state. The following two lines update the eps variable until it has a value higher than 0.01. We set this threshold to keep, in the long run, a minimum rate of exploration of the environment. The last line chooses an -greedy action based on the current state and the Q-matrix. We'll define this function later.

Now that we have taken care of the initialization needed at the start of each episode and have chosen the first action, we can loop until the episode (the game) ends. The following piece of code samples from the environment and updates the following Q-function, as per formula (5):

        while not done:
            next_state, rew, done, _ = env.step(action) # Take one step in the environment

            next_action = eps_greedy(Q, next_state, eps)
            Q[state][action] = Q[state][action] + lr*(rew + gamma*Q[next_state][next_action] - Q[state][action]) # (4.5)
            state = next_state
            action = next_action
           tot_rew += rew
            if done:
                games_reward.append(tot_rew)

done holds a Boolean value that indicates whether the agent is still interacting with the environment, as can be seen in line 2. Therefore, to loop for a complete episode is the same as iterating as long as done is False (the first line of the code). Then, as usual, env.step returns the next state, the reward, the done flag, and an information string. In the next line, eps_greedy chooses the next action based on the next_state and the Q-values. The heart of the SARSA algorithm is contained in the subsequent line, which performs the update as per formula (5). Besides the learning rate and the gamma coefficient, it uses the reward obtained in the last step and the values held in the Q array.

The final lines set the state and action as the previous one, adds the reward to the total reward of the game, and if the environment is in a final state, the sum of the rewards is appended to games_reward and the inner cycle terminates.

In the last lines of the SARSA function, every 300 epochs, we run 1,000 test games and print information such as the epoch, the eps value, and the mean of the test rewards. Moreover, we return the Q array:

        if (ep % 300) == 0:
            test_rew = run_episodes(env, Q, 1000)
            print("Episode:{:5d} Eps:{:2.4f} Rew:{:2.4f}".format(ep, eps, test_rew))
            test_rewards.append(test_rew)
    return Q

We can now implement the eps_greedy function, which chooses a random action from those that are allowed with probability, eps. To do this, it just samples a uniform number between 0 and 1, and if this is smaller than eps, it selects a random action. Otherwise, it selects a greedy action:

def eps_greedy(Q, s, eps=0.1):
    if np.random.uniform(0,1) < eps:
        # Choose a random action
        return np.random.randint(Q.shape[1])
    else:
    # Choose the greedy action
    return greedy(Q, s)

The greedy policy is implemented by returning the index that corresponds to the maximum Q value in state s:

def greedy(Q, s):    
    return np.argmax(Q[s])

The last function to implement is run_episodes, which runs a few episodes to test the policy. The policy used to select the actions is the greedy policy. That's because we don't want to explore while testing. Overall, the function is almost identical to the one implemented in the previous chapter for the dynamic programming algorithms:

def run_episodes(env, Q, num_episodes=100, to_print=False):
    tot_rew = []
    state = env.reset()
    for _ in range(num_episodes):
        done = False
        game_rew = 0
        while not done:
            next_state, rew, done, _ = env.step(greedy(Q, state))
            state = next_state
            game_rew += rew 
            if done:
                state = env.reset()
                tot_rew.append(game_rew)
    if to_print:
        print('Mean score: %.3f of %i games!'%(np.mean(tot_rew), num_episodes))
    else:
        return np.mean(tot_rew)

Great!

Now we're almost done. The last part involves only creating and resetting the environment and the call to the SARSA function, passing the environment along with all the hyperparameters:

if __name__ == '__main__':
    env = gym.make('Taxi-v2')
    env.reset()
    Q = SARSA(env, lr=.1, num_episodes=5000, eps=0.4, gamma=0.95, eps_decay=0.001)

As you can see, we start with an eps of 0.4. This means that the first actions will be random with a probability of 0.4 and because of the decay, it will decrease until it reaches the minimum value of 0.01 (that is, the threshold we set in the code):

Figure 4.5 Results of the SARSA algorithm on Taxi-v2

The performance plot of the test games' cumulative rewards is shown in figure 4.5. Moreover, figure 4.6 shows a complete episode run with the final policy. It has to be read from left to right and from top to bottom. We can see that the taxi (highlighted in yellow first, and green later) has driven along an optimal path in both directions.

Figure 4.6 Render of the Taxi game. The policy derives from the Q-values trained with SARSA

For all the color references mentioned in the chapter, please refer to the color images bundle at http://www.packtpub.com/sites/default/files/downloads/9781789131116_ColorImages.pdf.

To have a better view of the algorithm and all the hyperparameters, we suggest you play with them, change them, and observe the results. You can also try to use an exponential -decay rate instead of a linear one. You learn by doing just as RL algorithms do, by trial and error.

Table of Contents for Applying SARSA to Taxi-v2

Create new playlist

Sign In

Sign Up

Table of Contents for
Applying SARSA to Taxi-v2