Applying Q-learning to Taxi-v2

In general, Q-learning can be used to solve the same kinds of problems that can be tackled with SARSA, and because they both come from the same family (TD learning), they generally have similar performances. Nevertheless, in some specific problems, one approach can be preferred to the other. So it's useful to also know how Q-learning is implemented.

For this reason, here we'll implement Q-learning to solve Taxi-v2, the same environment that was used for SARSA. But be aware that with just a few adaptations, it can be used with every other environment with the correct characteristics. Having the results from both Q-learning and SARSA from the same environment we'll have the opportunity to compare their performance.

To be as consistent as possible, we kept some functions unchanged from the SARSA implementation. These are as follows:

eps_greedy(Q,s,eps) is the -greedy policy that takes a Q matrix, a state s, and the eps value. It returns an action.
greedy(Q,s) is the greedy policy that takes a Q matrix and a state s. It returns the action associated with the maximum Q-value in the state s.
run_episodes(env,Q,num_episodes,to_print) is a function that runs num_episodes games to test the greedy policy associated with the Q matrix. If to_print is True it prints the results. Otherwise, it returns the mean of the rewards.

To see the implementation of those functions, you can refer to the SARSA applied to Taxi-v2 section or the GitHub repository of the book, which can be found at https://github.com/PacktPublishing/Reinforcement-Learning-Algorithms-with-Python.

The main function that executes the Q-learning algorithm takes an environment, env; a learning rate, lr (the variable used in (6)); the number of episodes to train the algorithm, num_episodes; the initial value, eps, used by the -greedy policy; the decay rate, eps_decay; and the discount factor, gamma, as arguments:

def Q_learning(env, lr=0.01, num_episodes=10000, eps=0.3, gamma=0.95, eps_decay=0.00005):
    nA = env.action_space.n
    nS = env.observation_space.n

    # Q(s,a) -> each row is a different state and each columns represent a different action
    Q = np.zeros((nS, nA))

    games_reward = []
    test_rewards = []

The first lines of the function initialize the variables with the dimensions of the action and observation space, initialize the array Q that contains the Q-value of each state-action pair, and create empty lists used to keep track of the progress of the algorithm.

Then, we can implement the cycle that iterates num_episodes times:

    for ep in range(num_episodes):
        state = env.reset()
        done = False
        tot_rew = 0
        if eps > 0.01:
            eps -= eps_decay

Each iteration (that is, each episode) starts by resetting the environment, initializing the done and tot_rew variables, and decreasing eps linearly.

Then, we have to iterate across all of the timesteps of an episode (that correspond to an episode) because that is where the Q-learning update takes place:

        while not done:
            action = eps_greedy(Q, state, eps)
            next_state, rew, done, _ = env.step(action) # Take one step in the environment

            # get the max Q value for the next state
            Q[state][action] = Q[state][action] + lr*(rew + gamma*np.max(Q[next_state]) - Q[state][action]) # (4.6)
            state = next_state
            tot_rew += rew

            if done:
                games_reward.append(tot_rew)

This is the main body of the algorithm. The flow is fairly standard:

The action is chosen following the -greedy policy (the behavior policy).
The action is executed in the environment, which returns the next state, a reward, and the done flag.
The action-state value is updated based on formula (6).
next_state is assigned to the state variable.
The reward of the last step is added up to the cumulative reward of the episode.
If it was the final step, the reward is stored in games_reward and the cycle terminates.

In the end, every 300 iterations of the outer cycle, we can run 1,000 games to test the agent, print some useful information, and return the Q array:

        if (ep % 300) == 0:
            test_rew = run_episodes(env, Q, 1000)
            print("Episode:{:5d} Eps:{:2.4f} Rew:{:2.4f}".format(ep, eps, test_rew))
            test_rewards.append(test_rew)
    return Q

That's everything. As a final step, in the main function, we can create the environment and run the algorithm:

if __name__ == '__main__':
    env = gym.make('Taxi-v2')
    Q = Q_learning(env, lr=.1, num_episodes=5000, eps=0.4, gamma=0.95, eps_decay=0.001)

The algorithm reaches steady results after about 3,000 episodes, as can be deduced from figure 4.8. This plot can be created by plotting test_rewards:

Figure 4.8 The results of Q-learning on Taxi-v2

As usual, we suggest that you tune the hyperparameters and play with the implementation to gain better insight into the algorithm.

Overall, the algorithm has found a policy similar to the one found by the SARSA algorithm. To find it by yourself, you can render some episodes or print the greedy action resulting from the Q array.

Table of Contents for Applying Q-learning to Taxi-v2

Create new playlist

Sign In

Sign Up

Table of Contents for
Applying Q-learning to Taxi-v2