Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

14 Making Decisions in Complex Environments with Reinforcement Learning

In the previous chapter, we focused on RNNs for sequential learning. The last chapter of the book will be about reinforcement learning, which is the third type of machine learning task mentioned at the beginning of the book. You will see how learning from experience and learning by interacting with the environment differs from previously covered supervised and unsupervised learning.

We will cover the following topics in this chapter:

Setting up a workspace for reinforcement learning
Basics of reinforcement learning
Simulation of OpenAI Gym environments
Value iteration and policy iteration algorithms
Monte Carlo methods for policy evaluation and control
The Q-learning algorithm

Setting up the working environment

Let's get started with setting up the working environment, including PyTorch as the main framework, and OpenAI Gym, the toolkit that gives you a variety of environments to develop your learning algorithms on.

Installing PyTorch

PyTorch (https://pytorch.org/) is a trendy machine learning library developed on top of Torch (http://torch.ch/) by Facebook's AI lab. It provides powerful computational graphs and high compatibility to GPUs, as well as a simple and friendly interface. PyTorch is expanding quickly in academia and it has seen heavy adoption by more and more companies. The following chart (taken from http://horace.io/pytorch-vs-tensorflow/) shows the growth of PyTorch at top machine learning conferences:

Figure 14.1: Number of PyTorch papers in top machine learning conferences

In the past year, there have been more mentions of PyTorch than TensorFlow at those conferences. Hopefully, you are motivated enough to work with PyTorch. Now let's see how to properly install it.

Firstly, in the following table on the page https://pytorch.org/get-started/locally/, you can pick the right configurations for your environment:

Figure 14.2: Installing PyTorch with system configurations

Here, I use Mac, Conda, and Python 3.7 running locally (no CUDA) as an example, and run the suggested command line:

conda install pytorch torchvision -c pytorch

Next, you can run the following lines of code in Python to confirm correct installation:

>>> import torch
>>> x = torch.empty(3, 4)
>>> print(x)
tensor([[7.8534e+34, 4.7418e+30, 5.9663e-02, 7.0374e+22],
        [3.5788e+01, 4.5825e-41, 4.0272e+01, 4.5825e-41],
        [0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00]])

Here, the tensor in PyTorch is similar to the ndarrays in NumPy, or the tensor in TensorFlow. We have just created a tensor of size 3 * 4. It is an empty matrix with a bunch of meaningless placeholder floats. Again, this is very similar to NumPy's empty array.

If you want to get more familiar with PyTorch, you can go through the Getting Started sections in the official tutorial https://pytorch.org/tutorials/#getting-started. I recommend you at least finish these two:

What is PyTorch?: https://pytorch.org/tutorials/beginner/blitz/tensor_tutorial.html#sphx-glr-beginner-blitz-tensor-tutorial-py
Learning PyTorch with examples: https://pytorch.org/tutorials/beginner/pytorch_with_examples.html

By now, we have successfully set up PyTorch. Let's look at installing OpenAI Gym in the next section.

You are not limited to PyTorch for reinforcement learning. TensorFlow is always a good option. It's just beneficial to learn the trending framework PyTorch in the last chapter of this book.

Installing OpenAI Gym

OpenAI Gym (https://gym.openai.com/) is a powerful open source toolkit for developing and comparing reinforcement learning algorithms. It provides a variety of environments to develop your reinforcement learning algorithms on. It is developed by OpenAI (https://openai.com/), a non-profit research company focused on building safe and beneficial Artificial General Intelligence (AGI).

There are two ways to install Gym. The first one is via pip, as follows:

pip install gym

Another approach is to build from source by cloning the package from its Git repository and installing it from there:

git clone https://github.com/openai/gym
cd gym
pip install -e .

After the installation, you can check the available Gym environment by running the following code:

>>> from gym import envs
>>> print(envs.registry.all())
dict_values([EnvSpec(Copy-v0), EnvSpec(RepeatCopy-v0), EnvSpec(ReversedAddition-v0), EnvSpec(ReversedAddition3-v0), EnvSpec(DuplicatedInput-v0), EnvSpec(Reverse-v0), EnvSpec(CartPole-v0), EnvSpec(CartPole-v1), EnvSpec(MountainCar-v0), EnvSpec(MountainCarContinuous-v0), EnvSpec(Pendulum-v0), EnvSpec(Acrobot-v1), EnvSpec(LunarLander-v2), EnvSpec(LunarLanderContinuous-v2), EnvSpec(BipedalWalker-v2), EnvSpec(BipedalWalkerHardcore-v2), EnvSpec(CarRacing-v0), EnvSpec(Blackjack-v0)
……
……

You can see the full list of environments at https://gym.openai.com/envs/, including walking, moon landing, car racing, and Atari games. Feel free to play around with Gym.

When benchmarking different reinforcement learning algorithms, we need to apply them in a standardized environment. Gym is a perfect place with a number of versatile environments. This is similar to using datasets such as MNIST, ImageNet, and Thomson Reuters News as benchmarks in supervised and unsupervised learning.

Gym has an easy-to-use interface for the reinforcement learning environments, which we can write agents to interact with. So what's reinforcement learning? What's an agent? Let's see in the next section.

Introducing reinforcement learning with examples

In this chapter, I will first introduce the elements of reinforcement learning along with an interesting example, then will move on to how we measure feedback from the environment, and follow with the fundamental approaches to solve reinforcement learning problems.

Elements of reinforcement learning

You may have played Super Mario (or Sonic) when you were young. During the video game, you control Mario to collect coins and avoid obstacles at the same time. The game ends if Mario hits an obstacle or falls in a gap. And you try to get as many coins as possible before the game ends.

Reinforcement learning is very similar to the Super Mario game. Reinforcement learning is about learning what to do. It observes the situations in the environment and determines the right actions in order to maximize a numerical reward. Here is the list of elements in a reinforcement learning task (I also link each element to Super Mario and other examples so it's easier to understand):

Environment: The environment is a task or simulation. In the Super Mario game, the game itself is the environment. In self-driving, the road and traffic are the environment. In AlphaGo playing chess, the board is the environment. The inputs to the environment are the actions sent from the agent and the outputs are states and rewards sent to the agent.
Agent: The agent is the component that takes actions according to the reinforcement learning model. It interacts with the environment and observes the states to feed into the model. The goal of the agent is to solve the environment—finding the best set of actions to maximize the rewards. The agent in the Super Mario game is Mario, and the autonomous vehicle is for self-driving.
Action: This is the possible movement of the agent. It is usually random in a reinforcement learning task at the beginning when the model starts to learn about the environment. Possible actions for Mario include moving left and right, jumping, and crouching.
States: The states are the observations from the environment. They describe the situation in a numerical way at every time step. For the chess game, the state is the positions of all the pieces on the board. For Super Mario, the state includes the coordinates of Mario and other elements in the time frame. For a robot learning to walk, the position of its two legs is the state.
Rewards: Every time the agent takes an action, it receives numerical feedback from the environment. The feedback is called the reward. It can be positive, negative, or zero. The reward in the Super Mario game can be, for example, +1 if Mario collects a coin, +2 if he avoids an obstacle, -10 if he hits an obstacle, or 0 for other cases.

The following diagram summarizes the process of reinforcement learning:

Figure 14.3: Reinforcement learning process

The reinforcement learning process is an iterative loop. At the beginning, the agent observes the initial state, s₀, from the environment. Then the agent takes an action, a₀, according to the model. After the agent moves, the environment is now in a new state, s₁, and it gives a feedback reward, R₁. The agent then takes an action, a₁, as computed by the model with inputs s₁ and R₁. This process continues until termination, completion, or for forever.

The goal of the reinforcement learning model is to maximize the total reward. So how can we calculate the total reward? Is it simply by summing up rewards at all the time steps? Let's see in the next section.

Cumulative rewards

At time step t, the cumulative rewards (also called returns) G₁ can be written as:

Here, T is the termination time step or infinity. G_t means the total future reward after taking an action a_t at time t. At each time step t, the reinforcement learning model attempts to learn the best possible action in order to maximize G_t.

However, in many real-world cases, things don't work this way where we simply sum up all future rewards. Take a look at the following example:

Stock A rises 6 dollars at the end of day 1 and falls 5 dollars at the end of day 2. Stock B falls 5 dollars on day 1 and rises 6 dollars on day 2. After two days, both stocks rise 1 dollar. So which one will you buy at the beginning of day 1? Obviously stock A because you won't lose money and can even profit 6 dollars if you sell it at the beginning of day 2.

Both stocks have the same total reward but we favor stock A as we care more about immediate return than distant return. Similarly in reinforcement learning, we discount rewards in the distant future and the discount factor is associated with the time horizon. Longer time horizons should have less impact on the cumulative rewards. This is because longer time horizons include more irrelevant information and consequently are of higher variance.

We define a discount factor with a value between 0 and 1. We rewrite the cumulative rewards incorporating the discount factor:

As you can see, the larger the , the smaller the discount and vice versa. If , there is literally no discount and the model evaluates an action based on the sum total of all future rewards. If , the model only focuses on the immediate reward R_t₊₁.

Now that we know how to calculate the cumulative reward, the next thing to talk about is how to maximize it.

Approaches to reinforcement learning

There are mainly two approaches to solving reinforcement learning problems, which are about finding the optimal actions to maximize the cumulative rewards. One is a policy-based approach and another one is value-based.

A policy is a function that maps each input state to an action:

It can be either deterministic or stochastic:

Deterministic: There is one-to-one mapping from the input state to the output action
Stochastic: It gives a probability distribution over all possible actions

In the policy-based approach, the model learns the optimal policy that maps each input state to the best action.

The value V of a state is defined as the expected future cumulative reward to collect from the state:

In the value-based approach, the model learns the optimal value function that maximizes the value of the input state. In other words, the agent takes an action to reach the state that achieves the largest value.

In a policy-based algorithm, the model starts with a random policy. It then computes the value function of that policy. This step is called the policy evaluation step. After this, it finds a new and better policy based on the value function. This is the policy improvement step. These two steps repeat until the optimal policy is found. Whereas in a value-based algorithm, the model starts with a random value function. It then finds a new and improved value function in an iterative manner, until it reaches the optimal value function.

We've learned there are two main approaches to solve reinforcement learning problems. In the next section, let's see how to solve a concrete reinforcement learning example (FrozenLake) using a concrete algorithm, the dynamic programming method, in a policy-based and value-based way respectively.

Solving the FrozenLake environment with dynamic programming

We will focus on the policy-based and value-based dynamic programming algorithms in this section. But let's start with simulating the FrozenLake environment.

Simulating the FrozenLake environment

FrozenLake is a typical OpenAI Gym environment with discrete states. It is about moving the agent from the starting tile to the destination tile in a grid, and at the same time avoiding traps. The grid is either 4 * 4 (https://gym.openai.com/envs/FrozenLake-v0/), or 8 * 8 (https://gym.openai.com/envs/FrozenLake8x8-v0/). There are four types of tiles in the grid:

S: The starting tile. This is state 0, and it comes with 0 reward.
G: The goal tile. It is state 15 in the 4 * 4 grid. It gives +1 reward and terminates an episode.
F: The frozen tile. In the 4 * 4 grid, states 1, 2, 3, 4, 6, 8, 9, 10, 13, and 14 are walkable tiles. It gives 0 reward.
H: The hole tile. In the 4 * 4 grid, states 5, 7, 11, and 12 are hole tiles. It gives 0 reward and terminates an episode.

Here, an episode means a simulation of a reinforcement learning environment. It contains a list of states from the initial state to the terminal state, a list of actions and rewards. In the 4 * 4 FrozenLake environment, there are 16 possible states as the agent can move to any of the 16 tiles. And there are four possible actions: moving left (0), down (1), right (2), and up (3).

The tricky part of this environment is that, as the ice surface is slippery, the agent won't always move in the direction it intends and can move in any other walkable directions or stay unmoved at some probabilities. For example, it may move to the right even though it intends to move up.

Now let's simulate the 4 * 4 FrozenLake environment by following these steps:

To simulate any OpenAI Gym environment, we need to first look up its name in the table https://github.com/openai/gym/wiki/Table-of-environments. We get "FrozenLake-v0" in our case.

We import the Gym library and create a FrozenLake instance:

>>> import gym
>>> env = gym.make("FrozenLake-v0")
>>> n_state = env.observation_space.n
>>> print(n_state)
16
>>> n_action = env.action_space.n
>>> print(n_action)
4

We also obtain the dimensions of the environment.

Every time we run a new episode, we need to reset the environment:
```
>>> env.reset() 
0
```
It means that the agent starts with state 0. Again, there are 16 possible states, 0, 1, …, 15.
We render the environment to display it:
```
>>> env.render()
```
You will see a 4 * 4 matrix representing the FrozenLake grid and the tile (state 0) where the agent is located:

Figure 14.4: Initial state of FrozenLake
Let's take a right action since it is walkable:
```
>>> new_state, reward, is_done, info = env.step(2)
>>> print(new_state)
1
>>> print(reward)
0.0
>>> print(is_done)
False
>>> print(info)
{'prob': 0.3333333333333333}
```
The agent moves right to state 1, at a probability of 33.33%, and gets 0 reward since the episode is not done yet. Also see the render result:
```
>>> env.render()
```
Figure 14.5: Result of the agent moving right

You may get a completely different result as the agent can move down to state 4 at a probability of 33.33%, or stay at state 0 at a probability of 33.33%.

Next, we define a function that simulates a FrozenLake episode under a given policy and returns the total reward (as an easy start, let's just assume discount factor

>>> def run_episode(env, policy):
...     state = env.reset()
...     total_reward = 0
...     is_done = False
...     while not is_done:
...         action = policy[state].item()
...         state, reward, is_done, info = env.step(action)
...         total_reward += reward
...         if is_done:
...             break
...     return total_reward

Here, policy is a PyTorch tensor, and .item() extracts the value of an element on the tensor.

Now let's play around with the environment using a random policy. We will implement a random policy (where random actions are taken) and calculate the average total reward over 1,000 episodes:

>>> n_episode = 1000
>>> total_rewards = []
>>> for episode in range(n_episode):
...     random_policy = torch.randint(high=n_action, size=(n_state,))
...     total_reward = run_episode(env, random_policy)
...     total_rewards.append(total_reward)
...
>>> print(f'Average total reward under random policy: {sum(total_rewards)/n_episode}')
Average total reward under random policy: 0.014

On average, there is a 1.4% chance that the agent can reach the goal if we take random actions. This tells us it is not as easy to solve the FrozenLake environment as you might think.

As a bonus step, you can look into the transition matrix. The transition matrix contains probabilities of taking action a from state s then reaching . Take state 6 as an example:
```
>>> print(env.env.P[6])
{0: [(0.3333333333333333, 2, 0.0, False), (0.3333333333333333, 5, 0.0, True), (0.3333333333333333, 10, 0.0, False)], 1: [(0.3333333333333333, 5, 0.0, True), (0.3333333333333333, 10, 0.0, False), (0.3333333333333333, 7, 0.0, True)], 2: [(0.3333333333333333, 10, 0.0, False), (0.3333333333333333, 7, 0.0, True), (0.3333333333333333, 2, 0.0, False)], 3: [(0.3333333333333333, 7, 0.0, True), (0.3333333333333333, 2, 0.0, False), (0.3333333333333333, 5, 0.0, True)]}
```
The keys of the returning dictionary 0, 1, 2, 3 represent four possible actions. The value of a key is a list of tuples associated with the action. The tuple is in the format of (transition probability, new state, reward, is terminal state or not). For example, if the agent intends to take action 1 (down) from state 6, it will move to state 5 (H) with 33.33% probability and receive 0 reward and the episode will end consequently; it will move to state 10 with 33.33% probability and receive 0 reward; it will move to state 7 (H) with 33.33% probability and receive 0 reward and terminate the episode.

We've experimented with the random policy in this section, and we only succeeded 1.4% of the time. But this gets you ready for the next section where we will find the optimal policy using the value-based dynamic programming algorithm, called the value iteration algorithm.

Solving FrozenLake with the value iteration algorithm

Value iteration is an iterative algorithm. It starts with random policy values,V, and then iteratively updates the values based on the Bellman optimality equation (https://en.wikipedia.org/wiki/Bellman_equation) until the values converge.

It is usually difficult for the values to completely converge. Hence, there are two criteria of convergence. One is passing a fixed number of iterations, such as 1,000 or 10,000. Another one is specifying a threshold (such as 0.0001, or 0.00001) and we terminate the process if the changes of all values are less than the threshold.

Importantly, in each iteration, instead of taking the expectation (average) of values across all actions, it picks the action that maximizes the policy values. The iteration process can be expressed as follows:

Here, is the optimal value function; denotes the transition probability of moving to state from state s by taking action a; and is the reward provided in state by taking action .

Once we obtain the optimal values, we can easily compute the optimal policy accordingly:

Let's solve the FrozenLake environment using the value iteration algorithm as follows:

First we set 0.99 as the discount factor, and 0.0001 as the convergence threshold:
```
>>> gamma = 0.99
>>> threshold = 0.0001
```

We develop the value iteration algorithm, which computes the optimal values:

>>> def value_iteration(env, gamma, threshold):
...     """
...     Solve a given environment with value iteration algorithm
...     @param env: OpenAI Gym environment
...     @param gamma: discount factor
...     @param threshold: the evaluation will stop once values for all states are less than the threshold
...     @return: values of the optimal policy for the given environment
...     """
...     n_state = env.observation_space.n
...     n_action = env.action_space.n
...     V = torch.zeros(n_state)
...     while True:
...         V_temp = torch.empty(n_state)
...         for state in range(n_state):
...             v_actions = torch.zeros(n_action)
...             for action in range(n_action):
...                 for trans_prob, new_state, reward, _ in 
                                       env.env.P[state][action]:
...                     v_actions[action] += trans_prob * (
                                     reward + gamma * V[new_state])
...             V_temp[state] = torch.max(v_actions)
...         max_delta = torch.max(torch.abs(V - V_temp))
...         V = V_temp.clone()
...         if max_delta <= threshold:
...             break
...     return V

The value_iteration function does the following tasks:

Starting with policy values as all 0s
Updating the values based on the Bellman optimality equation
Computing the maximal change of the values across all states
Continuing updating the values, if the maximal change is greater than the convergence threshold
Otherwise, terminating the iteration process and returning the last values as the optimal values

We apply the algorithm to solve the FrozenLake environment along with the specified parameters:

>>> V_optimal = value_iteration(env, gamma, threshold)
Take a look at the resulting optimal values:
>>> print('Optimal values:
', V_optimal)
Optimal values:
tensor([0.5404, 0.4966, 0.4681, 0.4541, 0.5569, 0.0000, 0.3572, 0.0000, 0.5905, 0.6421, 0.6144, 0.0000, 0.0000, 0.7410, 0.8625, 0.0000])

Since we have the optimal values, we can extract the optimal policy from the values. We develop the following function to do this:

>>> def extract_optimal_policy(env, V_optimal, gamma):
...     """
...     Obtain the optimal policy based on the optimal values
...     @param env: OpenAI Gym environment
...     @param V_optimal: optimal values
...     @param gamma: discount factor
...     @return: optimal policy
...     """
...     n_state = env.observation_space.n
...     n_action = env.action_space.n
...     optimal_policy = torch.zeros(n_state)
...     for state in range(n_state):
...         v_actions = torch.zeros(n_action)
...         for action in range(n_action):
...             for trans_prob, new_state, reward, _ in 
                                   env.env.P[state][action]:
...                 v_actions[action] += trans_prob * (
                           reward + gamma * V_optimal[new_state])
...         optimal_policy[state] = torch.argmax(v_actions)
...     return optimal_policy

Then we obtain the optimal policy based on the optimal values:
```
>>> optimal_policy = extract_optimal_policy(env, V_optimal, gamma)
```
Take a look at the resulting optimal policy:
```
>>> print('Optimal policy:
', optimal_policy)
Optimal policy:
tensor([0., 3., 3., 3., 0., 3., 2., 3., 3., 1., 0., 3., 3., 2., 1., 3.])
```
This means the optimal action in state 0 is 0 (left), 3 (up) in state 1, etc. This doesn't look very intuitive if you look at the grid. But remember that the grid is slippery and the agent can move in another direction than the desired one.

If you doubt that it is the optimal policy, you can run 1,000 episodes with the policy and gauge how good it is by checking the average reward, as follows:

>>> n_episode = 1000
>>> total_rewards = []
>>> for episode in range(n_episode):
...     total_reward = run_episode(env, optimal_policy)
...     total_rewards.append(total_reward)

Here, we reuse the run_episode function we defined in the previous section. Then we print out the average reward:

>>> print('Average total reward under the optimal policy:', sum(total_rewards) / n_episode)
Average total reward under the optimal policy: 0.75

Under the optimal policy computed by the value iteration algorithm, the agent reaches the goal tile 75% of the time. Can we do something similar with the policy-based approach? Let's see in the next section.

Solving FrozenLake with the policy iteration algorithm

The policy iteration algorithm has two components, policy evaluation and policy improvement. Similar to value iteration, it starts with an arbitrary policy and follows with a bunch of iterations.

In the policy evaluation step in each iteration, we first compute the values of the latest policy, based on the Bellman expectation equation:

In the policy improvement step, we derive an improved policy based on the latest policy values, again based on the Bellman optimality equation:

These two steps repeat until the policy converges. At convergence, the latest policy and its value are the optimal policy and the optimal value.

Let's develop the policy iteration algorithm and use it to solve the FrozenLake environment as follows:

We start with the policy_evaluation function that computes the values of a given policy:

>>> def policy_evaluation(env, policy, gamma, threshold):
...  """
...     Perform policy evaluation
...     @param env: OpenAI Gym environment
...     @param policy: policy matrix containing actions and
        their probability in each state
...     @param gamma: discount factor
...     @param threshold: the evaluation will stop once values 
        for all states are less than the threshold
...     @return: values of the given policy
...  """
...     n_state = policy.shape[0]
...     V = torch.zeros(n_state)
...     while True:
...         V_temp = torch.zeros(n_state)
...         for state in range(n_state):
...             action = policy[state].item()
...             for trans_prob, new_state, reward, _ in 
                                     env.env.P[state][action]:
...                 V_temp[state] += trans_prob * (
                                     reward + gamma * V[new_state])
...         max_delta = torch.max(torch.abs–V - V_temp))
...         V = V_temp.clone()
...         if max_delta <= threshold:
...             break
...     return V

The function does the following tasks:

Initializing the policy values with all 0s
Updating the values based on the Bellman expectation equation
Computing the maximal change of the values across all states
If the maximal change is greater than the threshold, keeping updating the values
Otherwise, terminating the evaluation process and returning the latest values

Next, we develop the second component, the policy improvement, in the following function:

>>> def policy_improvement(env, V, gamma):
...  """"""
...     Obtain an improved policy based on the values
...     @param env: OpenAI Gym environment
...     @param V: policy values
...     @param gamma: discount factor
...     @return: the policy
...  """"""
...     n_state = env.observation_space.n
...     n_action = env.action_space.n
...     policy = torch.zeros(n_state)
...     for state in range(n_state):
...         v_actions = torch.zeros(n_action)
...         for action in range(n_action):
...             for trans_prob, new_state, reward, _ in 
                                      env.env.P[state][action]:
...                 v_actions[action] += trans_prob * (
                                  reward + gamma * V[new_state])
...         policy[state] = torch.argmax(v_actions)
...     return policy

It derives a new and better policy from the input policy values, based on the Bellman optimality equation.

With both components ready, we now develop the whole policy iteration algorithm:

>>> def policy_iteration(env, gamma, threshold):
...  """
...     Solve a given environment with policy iteration algorithm
...     @param env: OpenAI Gym environment
...     @param gamma: discount factor
...     @param threshold: the evaluation will stop once values for all states are less than the threshold
...     @return: optimal values and the optimal policy for the given environment
...  """
...     n_state = env.observation_space.n
...     n_action = env.action_space.n
...     policy = torch.randint(high=n_action, 
                               size=(n_state,)).float()
...     while True:
...         V = policy_evaluation(env, policy, gamma, threshold)
...         policy_improved = policy_improvement(env, V, gamma)
...         if torch.equal(policy_improved, policy):
...             return V, policy_improved
...         policy = policy_improved

This function does the following tasks:

Initializing a random policy
Performing policy evaluation to update the policy values
Performing policy improvement to generate a new policy
If the new policy is different from the old one, updating the policy and running another iteration of policy evaluation and improvement
Otherwise, terminating the iteration process and returning the latest policy and its values

Next, we use policy iteration to solve the FrozenLake environment:

>>> V_optimal, optimal_policy = policy_iteration(env, gamma, threshold)

Finally, we display the optimal policy and its values:

>>> pri't('Optimal values'
', V_optimal)
Optimal values:
tensor([0.5404, 0.4966, 0.4681, 0.4541, 0.5569, 0.0000, 0.3572, 0.0000, 0.5905, 0.6421, 0.6144, 0.0000, 0.0000, 0.7410, 0.8625, 0.0000])
>>> pri't('Optimal policy'
', optimal_policy)
Optimal policy:
tensor([0., 3., 3., 3., 0., 3., 2., 3., 3., 1., 0., 3., 3., 2., 1., 3.])

We got the same results as the value iteration algorithm.

We have just solved the FrozenLake environment with the policy iteration algorithm. You may wonder how to choose between the value iteration and policy iteration algorithms. Take a look at the following table:

Table 14.1: Choosing between the policy iteration and value iteration algorithms

We solved a reinforcement learning problem using dynamic programming methods. They require a fully known transition matrix and reward matrix of an environment. And they have limited scalability for environments with many states. In the next section, we will continue our learning journey with the Monte Carlo method, which has no requirement of prior knowledge of the environment and is much more scalable.

Performing Monte Carlo learning

Monte Carlo (MC)-based reinforcement learning is a model-free approach, which means it doesn't need a known transition matrix and reward matrix. In this section, you will learn about MC policy evaluation on the Blackjack environment, and solve the environment with MC Control algorithms. Blackjack is a typical environment with an unknown transition matrix. Let's first simulate the Blackjack environment.

Simulating the Blackjack environment

Blackjack is a popular card game. The game has the following rules:

The player competes against a dealer and wins if the total value of their cards is higher and doesn't exceed 21.
Cards from 2 to 10 have values from 2 to 10.
Cards J, K, and Q have a value of 10.
The value of an ace can be either 1 or 11 (called a "usable" ace).
At the beginning, both parties are given two random cards, but only one of the dealer's cards is revealed to the player. The player can request additional cards (called hit) or stop having any more cards (called stick). Before the player calls stick, the player will lose if the sum of their cards exceeds 21 (called bust). After the player sticks, the dealer keeps drawing cards until the sum of cards reaches 17. If the sum of the dealer's cards exceeds 21, the player will win. If neither of the two parties busts, the one with higher points will win or it may be a draw.

The Blackjack environment (https://github.com/openai/gym/blob/master/gym/envs/toy_text/blackjack.py) in Gym is formulated as follows:

An episode of the environment starts with two cards for each party, and only one from the dealer's cards is observed.
An episode ends if there is a win or draw.
The final reward of an episode is +1 if the player wins, -1 if the player loses, or 0 if there is a draw.
In each round, the player can take any of the two actions, hit (1) and stick (0)

Now let's simulate the Blackjack environment and explore its states and actions:

First create a Blackjack instance:
```
>>> env = gym.make('Blackjack'v0')
```
Reset the environment:
```
>>> env.reset()
(7, 10, False)
```
It returns the initial state (a 3-dimensional vector):
- Player's current points (7 in this example)
- The points of the dealer's revealed card (10 in this example)
- Having a usable ace or not (False in this example)
The usable ace variable is True only if the player has an ace that can be counted as 11 without causing a bust. If the player doesn't have an ace, or has an ace but it busts, this state variable will become False.

For another state example (18, 6, True), it means that the player has an ace counted as 11 and a 7, and that the dealer's revealed card is value 6.
Let's now take some actions to see how the environment works. First, we take a hit action since we only have 7 points:
```
>>> env.step(1)
((13, 10, False), 0.0, False, {})
```
It returns a state (13, 10, False), a 0 reward, and the episode not being done (as in False).

Let's take another hit since we only have 13 points:

>>> env.step(1)
((19, 10, False), 0.0, False, {})

We have 19 points and think it is good enough. Then we stop drawing cards by taking action stick (0):
```
>>> env.step(0)
((19, 10, False), 1.0, True, {})
```
The dealer gets some cards and gets a bust. So the player wins and gets a +1 reward. The episode ends.

Feel free to play around with the Blackjack environment. Once you feel comfortable with the environment, you can move on to the next section, MC policy evaluation on a simple policy.

Performing Monte Carlo policy evaluation

In the previous section, we applied dynamic programming to perform policy evaluation, which is the value function of a policy. However, it won't work in most real-life situations where the transition matrix is not known beforehand. In this case, we can evaluate the value function using the MC method.

To estimate the value function, the MC method uses empirical mean return instead of expected return (as in dynamic programming). There are two approaches to compute the empirical mean return. One is first-visit, which averages returns only for the first occurrence of a state s among all episodes. Another one is every-visit, which averages returns for every occurrence of a state s among all episodes. Obviously, the first-visit approach has a lot less computation and is therefore more commonly used. And I will only cover the first-visit approach in this chapter.

In this section, we experiment with a simple policy, where we keep adding new cards until the total value reaches 18 (or 19, or 20 if you like). We perform first-visit MC evaluation on the simple policy as follows:

We first need to define a function that simulates a Blackjack episode under the simple policy:

>>> def run_episode(env, hold_score):
...     state = env.reset()
...     rewards = []
...     states = [state]
...     while True:
...         action = 1 if state[0] < hold_score else 0
...         state, reward, is_done, info = env.step(action)
...         states.append(state)
...         rewards.append(reward)
...         if is_done:
...             break
...     return states, rewards

In each round of an episode, the agent takes a hit if the current score is less than hold_score or a stick otherwise.

In the MC settings, we need to keep track of states and rewards over all steps. And in first-visit value evaluation, we average returns only for the first occurrence of a state among all episodes. We define a function that evaluates the simple Blackjack policy with first-visit MC:

>>> from collections import defaultdict
>>> def mc_prediction_first_visit(env, hold_score, gamma, n_episode):
...     V = defaultdict(float)
...     N = defaultdict(int)
...     for episode in range(n_episode):
...         states_t, rewards_t = run_episode(env, hold_score)
...         return_t = 0
...         G = {}
...         for state_t, reward_t in zip(
                           states_t[1::-1], rewards_t[::-1]):
...             return_t = gamma * return_t + reward_t
...             G[state_t] = return_t
...         for state, return_t in G.items():
...             if state[0] <= 21:
...                 V[state] += return_t
...                 N[state] += 1
...     for state in V:
...         V[state] = V[state] / N[state]
...     return V

The function performs the following tasks:

Running n_episode episodes under the simple Blackjack policy with function run_episode
For each episode, computing the returns G for the first visit of each state
For each state, obtaining the value by averaging its first returns from all episodes
Returning the resulting values

Note that here we ignore states where the player busts, since we know their values are -1.

We specify the hold_score as 18, the discount rate as 1 as a Blackjack episode is short enough, and will simulate 500,000 episodes:
```
>>> hold_score = 18
>>> gamma = 1
>>> n_episode = 500000
```

Now we plug in all variables to perform MC first-visit evaluation:

>>> value = mc_prediction_first_visit(env, hold_score, gamma, n_episode)

We then print the resulting values:

>>> print(value)
defaultdict(<cla's 'fl'at'>, {(20, 6, False): 0.6923485653560042, (17, 5, False): -0.24390243902439024, (16, 5, False): -0.19118165784832453, (20, 10, False): 0.4326379146490474, (20, 7, False): 0.7686220540168588, (16, 6, False): -0.19249478804725503,
……
……
(5, 9, False): -0.20612244897959184, (12, 7, True): 0.058823529411764705, (6, 4, False): -0.26582278481012656, (4, 8, False): -0.14937759336099585, (4, 3, False): -0.1680327868852459, (4, 9, False): -0.20276497695852536, (4, 4, False): -0.3201754385964912, (12, 8, True): 0.11057692307692307})

We have just computed the values for all possible 280 states:

>>> print('Number of stat's:', len(value))
Number of states: 280

We have just experienced computing the values of 280 states under a simple policy in the Blackjack environment using the MC method. The transition matrix of the Blackjack environment is not known beforehand. Moreover, obtaining the transition matrix (size 280 * 280) will be extremely costly if we go with the dynamic programming approach. In the MC-based solution, we just need to simulate a bunch of episodes and compute the empirical average values. In a similar manner, we will search for the optimal policy in the next section.

Performing on-policy Monte Carlo control

MC control is used to find the optimal policy for environments with unknown transition matrices. There are two types of MC control, on-policy and off-policy. In the on-policy approach, we execute the policy and evaluate and improve it iteratively; whereas in the off-policy approach, we train the optimal policy using data generated by another policy.

In this section, we focus on the on-policy approach. The way it works is very similar to the policy iteration method. It iterates between the following two phases, evaluation and improvement, until convergence:

In the evaluation phase, instead of evaluating the state value, we evaluate the action-value, which is commonly called the Q-value. Q-value Q(s, a) is the value of a state-action pair (s, a) when taking the action a in state s under a given policy. The evaluation can be conducted in a first-visit or an every-visit manner.
In the improvement phase, we update the policy by assigning the optimal action in each state:

Let's now search for the optimal Blackjack policy with on-policy MC control by following the steps below:

We start with developing a function that executes an episode by taking the best actions under the given Q-values:

>>> def run_episode(env, Q, n_action):
...     """
...     Run a episode given Q-values
...     @param env: OpenAI Gym environment
...     @param Q: Q-values
...     @param n_action: action space
...     @return: resulting states, actions and rewards for the entire episode
...     """
...     state = env.reset()
...     rewards = []
...     actions = []
...     states = []
...     action = torch.randint(0, n_action, [1]).item()
...     while True:
...         actions.append(action)
...         states.append(state)
...         state, reward, is_done, info = env.step(action)
...         rewards.append(reward)
...         if is_done:
...             break
...         action = torch.argmax(Q[state]).item()
...     return states, actions, rewards

This serves as the improvement phase. Specifically, it does the following tasks:

Initializing an episode
Taking a random action as an exploring start
After the first action, taking actions based on the given Q-value table, that is
Storing the states, actions, and rewards for all steps in the episode, which will be used for evaluation

Next, we develop the on-policy MC control algorithm:

>>> def mc_control_on_policy(env, gamma, n_episode):
...     """
...     Obtain the optimal policy with on-policy MC control method
...     @param env: OpenAI Gym environment
...     @param gamma: discount factor
...     @param n_episode: number of episodes
...     @return: the optimal Q-function, and the optimal policy
...     """
...     G_sum = defaultdict(float)
...     N = defaultdict(int)
...     Q = defaultdict(lambda: torch.empty(env.action_space.n))
...     for episode in range(n_episode):
...         states_t, actions_t, rewards_t = 
                       run_episode(env,  Q,  env.action_space.n)
...         return_t = 0
...         G = {}
...         for state_t, action_t, reward_t in zip(
                 states_t[::-1], actions_t[::-1],                                     rewards_t[::-1]):
...             return_t = gamma * return_t + reward_t
...             G[(state_t, action_t)] = return_t
...         for state_action, return_t in G.items():
...             state, action = state_action
...             if state[0] <= 21:
...                 G_sum[state_action] += return_t
...                 N[state_action] += 1
...                 Q[state][action] = 
                          G_sum[state_action] / N[state_action]
...     policy = {}
...     for state, actions in Q.items():
...         policy[state] = torch.argmax(actions).item()
...     return Q, policy

This function does the following tasks:

Randomly initializing the Q-values
Running n_episode episodes
For each episode, performing policy improvement and obtaining the training data; performing first-visit policy evaluation on the resulting states, actions, and rewards, and updating the Q-values
In the end, finalizing the optimal Q-values and the optimal policy

Now that the MC control function is ready, we compute the optimal policy:

>>> gamma = 1
>>> n_episode = 500000 
>>> optimal_Q, optimal_policy = mc_control_on_policy(env, gamma, n_episode)

Take a look at the optimal policy:

>>> print(optimal_policy)
{(16, 8, True): 1, (11, 2, False): 1, (15, 5, True): 1, (14, 9, False): 1, (11, 6, False): 1, (20, 3, False): 0, (9, 6, False): 
0, (12, 9, False): 0, (21, 2, True): 0, (16, 10, False): 1, (17, 5, False): 0, (13, 10, False): 1, (12, 10, False): 1, (14, 10, False): 0, (10, 2, False): 1, (20, 4, False): 0, (11, 4, False): 1, (16, 9, False): 0, (10, 8, 
……
……
1, (18, 6, True): 0, (12, 2, True): 1, (8, 3, False): 1, (13, 3, True): 0, (4, 7, False): 1, (18, 8, True): 0, (6, 5, False): 1, (17, 6, True): 0, (19, 9, True): 0, (4, 4, False): 0, (14, 5, True): 1, (12, 6, True): 0, (4, 9, False): 1, (13, 4, True): 1, (4, 8, False): 1, (14, 3, True): 1, (12, 4, True): 1, (4, 6, False): 0, (12, 5, True): 0, (4, 2, False): 1, (4, 3, False): 1, (5, 4, False): 1, (4, 1, False): 0}

You may wonder if this optimal policy is really optimal and better than the previous simple policy (hold at 18 points). Let's simulate 100,000 Blackjack episodes under the optimal policy and the simple policy respectively:

We start with the function that simulates an episode under the simple policy:

>>> def simulate_hold_episode(env, hold_score):
...     state = env.reset()
...     while True:
...         action = 1 if state[0] < hold_score else 0
...         state, reward, is_done, _ = env.step(action)
...         if is_done:
...             return reward

Next, we work on the simulation function under the optimal policy:

>>> def simulate_episode(env, policy):
...     state = env.reset()
...     while True:
...         action = policy[state]
...         state, reward, is_done, _ = env.step(action)
...         if is_done:
...             return reward

We then run 100,000 episodes for both policies and keep track of their winning times:

>>> n_episode = 100000
>>> hold_score = 18
>>> n_win_opt = 0
>>> n_win_hold = 0
>>> for _ in range(n_episode):
...     reward = simulate_episode(env, optimal_policy)
...     if reward == 1:
...         n_win_opt += 1
...     reward = simulate_hold_episode(env, hold_score)
...     if reward == 1:
...         n_win_hold += 1

We print out the results as follows:

>>> print(f'Winning probability:
Under the simple policy: {n_win_hold/n_episode}
Under the optimal policy: {n_win_opt/n_episode}')
Winning probability:
Under the simple policy: 0.39955
Under the optimal policy: 0.42779

Playing under the optimal policy has a 43% chance of winning, while playing under the simple policy has only 40% chance.

In this section, we solved the Blackjack environment with a model-free algorithm, MC learning. In MC learning, the Q-values are updated until the end of an episode. This could be problematic for long processes. In the next section, we will talk about Q-learning, which updates the Q-values for every step in an episode. You will see how it increases learning efficiency.

Solving the Taxi problem with the Q-learning algorithm

Q-learning is also a model-free learning algorithm. It updates the Q-function for every step in an episode. We will demonstrate how Q-learning is used to solve the Taxi environment. It is a typical environment with relatively long episodes. So let's first simulate the Taxi environment.

Simulating the Taxi environment

In the Taxi environment (https://gym.openai.com/envs/Taxi-v3/) the agent acts as a taxi driver to pick up the passenger from one location and drop off the passenger at the destination.

All subjects are on a 5 * 5 grid. Take a look at the following example:

Figure 14.6: Example of the Taxi environment

Tiles in certain colors have the following meanings:

Yellow: The location of the empty taxi (without the passenger)
Blue: The passenger's location
Purple: The passenger's destination
Green: The location of the taxi with the passenger

The starting positions of the empty taxi and the passenger and the passenger's destination are randomly assigned in each episode.

The four letters R, Y, B, and G are the only four locations that allow passenger pick-up and drop-off. The one in purple is the destination, and the one in blue is the passenger's location.

The taxi can take any of the following six actions:

0: Moving south
1: Moving north
2: Moving east
3: Moving west
4: Picking up the passenger
5: Dropping off the passenger

There is a pillar "|" between two tiles, which prevents the taxi from moving between two tiles.

In each step, the reward follows these rules:

+20 for driving the passenger to the destination. An episode will end in this situation. And another situation in which an episode will end is when there are 200 steps.
-10 for trying to pick up or drop off illegally (not on any of R, Y, B, or G).
-1 otherwise.

Last but not least, there are actually 500 possible states: obviously the taxi can be on any of the 25 tiles, the passenger can be on any of R, Y, B, G or inside the taxi, and the destination can be any of R, Y, B, G; hence, we have 25 * 5 * 4 = 500 possible states.

Now let's play around with the environment as follows:

First we create an instance of the Taxi environment:
```
>>> env = gym.make('Taxi-v3')
>>> n_state = env.observation_space.n
>>> print(n_state)
500
>>> n_action = env.action_space.n
>>> print(n_action)
6
```
We also know that the state is represented by an integer ranging from 0 to 499, and there are 6 possible actions.
We reset the environment and render it:
```
>>> env.reset() 
262
>>> env.render()
```
You will see a 5 * 5 grid similar to the following one:

Figure 14.7: Example starting step of the Taxi environment

The passenger is on the blue R tile, and the destination is on the purple Y.

Now let's go pick up the passenger by heading west for three tiles and north for two tiles (you will need to take different actions according to your initial state) then take the "pick-up" action:

>>> print(env.step(3))
(242, -1, False, {'prob': 1.0})
>>> print(env.step(3))
(222, -1, False, {'prob': 1.0})
>>> print(env.step(3))
(202, -1, False, {'prob': 1.0})
>>> print(env.step(1))
(102, -1, False, {'prob': 1.0})
>>> print(env.step(1))
(2, -1, False, {'prob': 1.0})
>>> print(env.step(4))
(18, -1, False, {'prob': 1.0})

Render the environment:

>>> env.render()

Figure 14.8: Example of a state where the passenger is inside the taxi

The taxi turns green, meaning the passenger is inside the taxi.

Now let's head to the destination by taking the "down" action four times (again, you will need to take your own set of actions) then executing a "drop-off":

>>> print(env.step(0))
(118, -1, False, {'prob': 1.0})
>>> print(env.step(0))
(218, -1, False, {'prob': 1.0})
>>> print(env.step(0))
(318, -1, False, {'prob': 1.0})
>>> print(env.step(0))
(418, -1, False, {'prob': 1.0})
>>> print(env.step(5))
(410, 20, True, {'prob': 1.0})

A +20 reward is granted in the end for a successful drop-off.

We render the environment finally:
```
>>> env.render()
```

Figure 14.9: Example of a state where the passenger arrives at the destination

You can take some random actions and see how difficult it is for a model to solve the environment. We will discuss the Q-learning algorithm in the next section.

Developing the Q-learning algorithm

Q-learning is an off-policy learning algorithm that optimizes the Q-values based on data generated by a behavior policy. The behavior policy is a greedy policy where it takes actions that achieve the highest returns for given states. The behavior policy generates learning data and the target policy (the policy we attempt to optimize) updates the Q-values based on the following equation:

Here, is the resulting state after taking action a from state s and r is the associated reward. means that the behavior policy generates the highest Q-value given state . Finally, hyperparameters and are the learning rate and discount factor respectively.

Learning from experience generated by another policy enables Q-learning to optimize its Q-values in every single step in an episode. We gain the information from a greedy policy and use this information to update the target values right away.

One more thing to note is that the target policy is epsilon-greedy, meaning it takes a random action with a probability of (value from 0 to 1) and takes a greedy action with a probability of . The epsilon-greedy policy combines exploitation and exploration: it exploits the best action while exploring different actions.

Now it is time to develop the Q-learning algorithm to solve the Taxi environment:

We start with defining the epsilon-greedy policy:

>>> def gen_epsilon_greedy_policy(n_action, epsilon):
...     def policy_function(state, Q):
...         probs = torch.ones(n_action) * epsilon / n_action
...         best_action = torch.argmax(Q[state]).item()
...         probs[best_action] += 1.0 - epsilon
...         action = torch.multinomial(probs, 1).item()
...         return action
...     return policy_function

Given |A| possible actions, each action is taken with a probability , and the action with the highest state-action value is chosen with an additional probability .

Now we create an instance of the epsilon-greedy-policy:

>>> epsilon = 0.1
>>> epsilon_greedy_policy = gen_epsilon_greedy_policy(env.action_space.n, epsilon)

Here, , which is the exploration ratio.

Next, we develop the Q-learning algorithm:

>>> def q_learning(env, gamma, n_episode, alpha):
...     """
...     Obtain the optimal policy with off-policy Q-learning method
...     @param env: OpenAI Gym environment
...     @param gamma: discount factor
...     @param n_episode: number of episodes
...     @return: the optimal Q-function, and the optimal policy
...     """
...     n_action = env.action_space.n
...     Q = defaultdict(lambda: torch.zeros(n_action))
...     for episode in range(n_episode):
...         state = env.reset()
...         is_done = False
...         while not is_done:
...             action = epsilon_greedy_policy(state, Q)
...             next_state, reward, is_done, info = 
                                          env.step(action)
...             delta = reward + gamma * torch.max(Q[next_state]) - 
                                      Q[state][action]
...             Q[state][action] += alpha * delta
...             length_episode[episode] += 1 
...             total_reward_episode[episode] += reward
...             if is_done:
...                 break
...             state = next_state
...     policy = {}
...     for state, actions in Q.items():
...         policy[state] = torch.argmax(actions).item()
...     return Q, policy

We first initialize the Q-table. Then in each episode, we let the agent take actions following the epsilon-greedy policy, and update the Q function for each step based on the off-policy learning equation. We run n_episode episodes and finally obtain the optimal policy and Q-values.

We then initiate two variables to store the performance of each of 1,000 episodes, the episode length (number of steps in an episode), and total reward:
```
>>> n_episode = 1000
>>> length_episode = [0] * n_episode
>>> total_reward_episode = [0] * n_episode
```
Finally, we perform Q-learning to obtain the optimal policy for the Taxi problem:
```
>>> gamma = 1
>>> alpha = 0.4
>>> optimal_Q, optimal_policy = q_learning(env, gamma, n_episode, alpha)
```
Here, discount rate , and learning rate .
After 1,000 episodes of learning, we plot the total rewards over episodes as follows:
```
>>> import matplotlib.pyplot as plt
>>> plt.plot(total_reward_episode)
>>> plt.title('Episode reward over time')
>>> plt.xlabel('Episode')
>>> plt.ylabel('Total reward')
>>> plt.ylim([-200, 20])
>>> plt.show()
```
Refer to the following screenshot for the end result:

Figure 14.10: Total rewards over episodes

The total rewards keep improving during learning. And they stay around +5 after 600 episodes.

We also plot the lengths over episodes as follows:

>>> plt.plot(length_episode)
>>> plt.title('Episode length over time')
>>> plt.xlabel('Episode')
>>> plt.ylabel('Length')
>>> plt.show()

Refer to the following screenshot for the end result:

Figure 14.11: Episode lengths over episodes

As you can see, the episode lengths decrease from the maximum 200 to around 10, and the model converges around 600 episodes. It means after training, the model is able to solve the problem in around 10 steps.

In this section, we solved the Taxi problem with off-policy Q-learning. The algorithm optimizes the Q-values in every single step by learning from the experience generated by a greedy policy.

Summary

We started the chapter by setting up the working environment. After that, we studied the fundamentals of reinforcement learning along with a few examples. After exploring the FrozenLake environment, we solved it with two dynamic programming algorithms, value iteration and policy iteration. We talked about Monte Carlo learning and used it for value approximation and control in the Blackjack environment. Lastly, we developed the Q-learning algorithm and solved the Taxi problem.

Exercises

Can you try to solve the 8 * 8 FrozenLake environment with the value iteration or policy iteration algorithm?
Can you implement the every-visit MC policy evaluation algorithm?
Can you use a different exploration ratio in the Q-learning algorithm and see how things change?

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Making Decisions in Complex Environments with Reinforcement Learning

Create new playlist

Sign In

Sign Up

14

Making Decisions in Complex Environments with Reinforcement Learning