In the previous chapter, we focused on RNNs for sequential learning. The last chapter of the book will be about reinforcement learning, which is the third type of machine learning task mentioned at the beginning of the book. You will see how learning from experience and learning by interacting with the environment differs from previously covered supervised and unsupervised learning.
We will cover the following topics in this chapter:
Let's get started with setting up the working environment, including PyTorch as the main framework, and OpenAI Gym, the toolkit that gives you a variety of environments to develop your learning algorithms on.
PyTorch (https://pytorch.org/) is a trendy machine learning library developed on top of Torch (http://torch.ch/) by Facebook's AI lab. It provides powerful computational graphs and high compatibility to GPUs, as well as a simple and friendly interface. PyTorch is expanding quickly in academia and it has seen heavy adoption by more and more companies. The following chart (taken from http://horace.io/pytorch-vs-tensorflow/) shows the growth of PyTorch at top machine learning conferences:
Figure 14.1: Number of PyTorch papers in top machine learning conferences
In the past year, there have been more mentions of PyTorch than TensorFlow at those conferences. Hopefully, you are motivated enough to work with PyTorch. Now let's see how to properly install it.
Firstly, in the following table on the page https://pytorch.org/get-started/locally/, you can pick the right configurations for your environment:
Figure 14.2: Installing PyTorch with system configurations
Here, I use Mac, Conda, and Python 3.7 running locally (no CUDA) as an example, and run the suggested command line:
conda install pytorch torchvision -c pytorch
Next, you can run the following lines of code in Python to confirm correct installation:
>>> import torch
>>> x = torch.empty(3, 4)
>>> print(x)
tensor([[7.8534e+34, 4.7418e+30, 5.9663e-02, 7.0374e+22],
[3.5788e+01, 4.5825e-41, 4.0272e+01, 4.5825e-41],
[0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00]])
Here, the tensor in PyTorch is similar to the ndarrays in NumPy, or the tensor in TensorFlow. We have just created a tensor of size 3 * 4. It is an empty matrix with a bunch of meaningless placeholder floats. Again, this is very similar to NumPy's empty array.
If you want to get more familiar with PyTorch, you can go through the Getting Started sections in the official tutorial https://pytorch.org/tutorials/#getting-started. I recommend you at least finish these two:
By now, we have successfully set up PyTorch. Let's look at installing OpenAI Gym in the next section.
You are not limited to PyTorch for reinforcement learning. TensorFlow is always a good option. It's just beneficial to learn the trending framework PyTorch in the last chapter of this book.
OpenAI Gym (https://gym.openai.com/) is a powerful open source toolkit for developing and comparing reinforcement learning algorithms. It provides a variety of environments to develop your reinforcement learning algorithms on. It is developed by OpenAI (https://openai.com/), a non-profit research company focused on building safe and beneficial Artificial General Intelligence (AGI).
There are two ways to install Gym. The first one is via pip
, as follows:
pip install gym
Another approach is to build from source by cloning the package from its Git repository and installing it from there:
git clone https://github.com/openai/gym
cd gym
pip install -e .
After the installation, you can check the available Gym environment by running the following code:
>>> from gym import envs
>>> print(envs.registry.all())
dict_values([EnvSpec(Copy-v0), EnvSpec(RepeatCopy-v0), EnvSpec(ReversedAddition-v0), EnvSpec(ReversedAddition3-v0), EnvSpec(DuplicatedInput-v0), EnvSpec(Reverse-v0), EnvSpec(CartPole-v0), EnvSpec(CartPole-v1), EnvSpec(MountainCar-v0), EnvSpec(MountainCarContinuous-v0), EnvSpec(Pendulum-v0), EnvSpec(Acrobot-v1), EnvSpec(LunarLander-v2), EnvSpec(LunarLanderContinuous-v2), EnvSpec(BipedalWalker-v2), EnvSpec(BipedalWalkerHardcore-v2), EnvSpec(CarRacing-v0), EnvSpec(Blackjack-v0)
……
……
You can see the full list of environments at https://gym.openai.com/envs/, including walking, moon landing, car racing, and Atari games. Feel free to play around with Gym.
When benchmarking different reinforcement learning algorithms, we need to apply them in a standardized environment. Gym is a perfect place with a number of versatile environments. This is similar to using datasets such as MNIST, ImageNet, and Thomson Reuters News as benchmarks in supervised and unsupervised learning.
Gym has an easy-to-use interface for the reinforcement learning environments, which we can write agents to interact with. So what's reinforcement learning? What's an agent? Let's see in the next section.
In this chapter, I will first introduce the elements of reinforcement learning along with an interesting example, then will move on to how we measure feedback from the environment, and follow with the fundamental approaches to solve reinforcement learning problems.
You may have played Super Mario (or Sonic) when you were young. During the video game, you control Mario to collect coins and avoid obstacles at the same time. The game ends if Mario hits an obstacle or falls in a gap. And you try to get as many coins as possible before the game ends.
Reinforcement learning is very similar to the Super Mario game. Reinforcement learning is about learning what to do. It observes the situations in the environment and determines the right actions in order to maximize a numerical reward. Here is the list of elements in a reinforcement learning task (I also link each element to Super Mario and other examples so it's easier to understand):
The following diagram summarizes the process of reinforcement learning:
Figure 14.3: Reinforcement learning process
The reinforcement learning process is an iterative loop. At the beginning, the agent observes the initial state, s0, from the environment. Then the agent takes an action, a0, according to the model. After the agent moves, the environment is now in a new state, s1, and it gives a feedback reward, R1. The agent then takes an action, a1, as computed by the model with inputs s1 and R1. This process continues until termination, completion, or for forever.
The goal of the reinforcement learning model is to maximize the total reward. So how can we calculate the total reward? Is it simply by summing up rewards at all the time steps? Let's see in the next section.
At time step t, the cumulative rewards (also called returns) G1 can be written as:
Here, T is the termination time step or infinity. Gt means the total future reward after taking an action at at time t. At each time step t, the reinforcement learning model attempts to learn the best possible action in order to maximize Gt.
However, in many real-world cases, things don't work this way where we simply sum up all future rewards. Take a look at the following example:
Stock A rises 6 dollars at the end of day 1 and falls 5 dollars at the end of day 2. Stock B falls 5 dollars on day 1 and rises 6 dollars on day 2. After two days, both stocks rise 1 dollar. So which one will you buy at the beginning of day 1? Obviously stock A because you won't lose money and can even profit 6 dollars if you sell it at the beginning of day 2.
Both stocks have the same total reward but we favor stock A as we care more about immediate return than distant return. Similarly in reinforcement learning, we discount rewards in the distant future and the discount factor is associated with the time horizon. Longer time horizons should have less impact on the cumulative rewards. This is because longer time horizons include more irrelevant information and consequently are of higher variance.
We define a discount factor with a value between 0 and 1. We rewrite the cumulative rewards incorporating the discount factor:
As you can see, the larger the , the smaller the discount and vice versa. If , there is literally no discount and the model evaluates an action based on the sum total of all future rewards. If , the model only focuses on the immediate reward Rt+1.
Now that we know how to calculate the cumulative reward, the next thing to talk about is how to maximize it.
There are mainly two approaches to solving reinforcement learning problems, which are about finding the optimal actions to maximize the cumulative rewards. One is a policy-based approach and another one is value-based.
A policy is a function that maps each input state to an action:
It can be either deterministic or stochastic:
In the policy-based approach, the model learns the optimal policy that maps each input state to the best action.
The value V of a state is defined as the expected future cumulative reward to collect from the state:
In the value-based approach, the model learns the optimal value function that maximizes the value of the input state. In other words, the agent takes an action to reach the state that achieves the largest value.
In a policy-based algorithm, the model starts with a random policy. It then computes the value function of that policy. This step is called the policy evaluation step. After this, it finds a new and better policy based on the value function. This is the policy improvement step. These two steps repeat until the optimal policy is found. Whereas in a value-based algorithm, the model starts with a random value function. It then finds a new and improved value function in an iterative manner, until it reaches the optimal value function.
We've learned there are two main approaches to solve reinforcement learning problems. In the next section, let's see how to solve a concrete reinforcement learning example (FrozenLake) using a concrete algorithm, the dynamic programming method, in a policy-based and value-based way respectively.
We will focus on the policy-based and value-based dynamic programming algorithms in this section. But let's start with simulating the FrozenLake environment.
FrozenLake is a typical OpenAI Gym environment with discrete states. It is about moving the agent from the starting tile to the destination tile in a grid, and at the same time avoiding traps. The grid is either 4 * 4 (https://gym.openai.com/envs/FrozenLake-v0/), or 8 * 8 (https://gym.openai.com/envs/FrozenLake8x8-v0/). There are four types of tiles in the grid:
Here, an episode means a simulation of a reinforcement learning environment. It contains a list of states from the initial state to the terminal state, a list of actions and rewards. In the 4 * 4 FrozenLake environment, there are 16 possible states as the agent can move to any of the 16 tiles. And there are four possible actions: moving left (0), down (1), right (2), and up (3).
The tricky part of this environment is that, as the ice surface is slippery, the agent won't always move in the direction it intends and can move in any other walkable directions or stay unmoved at some probabilities. For example, it may move to the right even though it intends to move up.
Now let's simulate the 4 * 4 FrozenLake environment by following these steps:
FrozenLake-v0
" in our case.>>> import gym
>>> env = gym.make("FrozenLake-v0")
>>> n_state = env.observation_space.n
>>> print(n_state)
16
>>> n_action = env.action_space.n
>>> print(n_action)
4
We also obtain the dimensions of the environment.
>>> env.reset()
0
It means that the agent starts with state 0. Again, there are 16 possible states, 0, 1, …, 15.
>>> env.render()
You will see a 4 * 4 matrix representing the FrozenLake grid and the tile (state 0) where the agent is located:
Figure 14.4: Initial state of FrozenLake
>>> new_state, reward, is_done, info = env.step(2)
>>> print(new_state)
1
>>> print(reward)
0.0
>>> print(is_done)
False
>>> print(info)
{'prob': 0.3333333333333333}
The agent moves right to state 1, at a probability of 33.33%, and gets 0 reward since the episode is not done yet. Also see the render result:
>>> env.render()
Figure 14.5: Result of the agent moving right
You may get a completely different result as the agent can move down to state 4 at a probability of 33.33%, or stay at state 0 at a probability of 33.33%.
>>> def run_episode(env, policy):
... state = env.reset()
... total_reward = 0
... is_done = False
... while not is_done:
... action = policy[state].item()
... state, reward, is_done, info = env.step(action)
... total_reward += reward
... if is_done:
... break
... return total_reward
Here, policy
is a PyTorch tensor, and .item()
extracts the value of an element on the tensor.
>>> n_episode = 1000
>>> total_rewards = []
>>> for episode in range(n_episode):
... random_policy = torch.randint(high=n_action, size=(n_state,))
... total_reward = run_episode(env, random_policy)
... total_rewards.append(total_reward)
...
>>> print(f'Average total reward under random policy: {sum(total_rewards)/n_episode}')
Average total reward under random policy: 0.014
On average, there is a 1.4% chance that the agent can reach the goal if we take random actions. This tells us it is not as easy to solve the FrozenLake environment as you might think.
>>> print(env.env.P[6])
{0: [(0.3333333333333333, 2, 0.0, False), (0.3333333333333333, 5, 0.0, True), (0.3333333333333333, 10, 0.0, False)], 1: [(0.3333333333333333, 5, 0.0, True), (0.3333333333333333, 10, 0.0, False), (0.3333333333333333, 7, 0.0, True)], 2: [(0.3333333333333333, 10, 0.0, False), (0.3333333333333333, 7, 0.0, True), (0.3333333333333333, 2, 0.0, False)], 3: [(0.3333333333333333, 7, 0.0, True), (0.3333333333333333, 2, 0.0, False), (0.3333333333333333, 5, 0.0, True)]}
The keys of the returning dictionary 0, 1, 2, 3 represent four possible actions. The value of a key is a list of tuples associated with the action. The tuple is in the format of (transition probability, new state, reward, is terminal state or not). For example, if the agent intends to take action 1 (down) from state 6, it will move to state 5 (H) with 33.33% probability and receive 0 reward and the episode will end consequently; it will move to state 10 with 33.33% probability and receive 0 reward; it will move to state 7 (H) with 33.33% probability and receive 0 reward and terminate the episode.
We've experimented with the random policy in this section, and we only succeeded 1.4% of the time. But this gets you ready for the next section where we will find the optimal policy using the value-based dynamic programming algorithm, called the value iteration algorithm.
Value iteration is an iterative algorithm. It starts with random policy values,V, and then iteratively updates the values based on the Bellman optimality equation (https://en.wikipedia.org/wiki/Bellman_equation) until the values converge.
It is usually difficult for the values to completely converge. Hence, there are two criteria of convergence. One is passing a fixed number of iterations, such as 1,000 or 10,000. Another one is specifying a threshold (such as 0.0001, or 0.00001) and we terminate the process if the changes of all values are less than the threshold.
Importantly, in each iteration, instead of taking the expectation (average) of values across all actions, it picks the action that maximizes the policy values. The iteration process can be expressed as follows:
Here, is the optimal value function; denotes the transition probability of moving to state from state s by taking action a; and is the reward provided in state by taking action .
Once we obtain the optimal values, we can easily compute the optimal policy accordingly:
Let's solve the FrozenLake environment using the value iteration algorithm as follows:
>>> gamma = 0.99
>>> threshold = 0.0001
>>> def value_iteration(env, gamma, threshold):
... """
... Solve a given environment with value iteration algorithm
... @param env: OpenAI Gym environment
... @param gamma: discount factor
... @param threshold: the evaluation will stop once values for all states are less than the threshold
... @return: values of the optimal policy for the given environment
... """
... n_state = env.observation_space.n
... n_action = env.action_space.n
... V = torch.zeros(n_state)
... while True:
... V_temp = torch.empty(n_state)
... for state in range(n_state):
... v_actions = torch.zeros(n_action)
... for action in range(n_action):
... for trans_prob, new_state, reward, _ in
env.env.P[state][action]:
... v_actions[action] += trans_prob * (
reward + gamma * V[new_state])
... V_temp[state] = torch.max(v_actions)
... max_delta = torch.max(torch.abs(V - V_temp))
... V = V_temp.clone()
... if max_delta <= threshold:
... break
... return V
The value_iteration
function does the following tasks:
>>> V_optimal = value_iteration(env, gamma, threshold)
Take a look at the resulting optimal values:
>>> print('Optimal values:
', V_optimal)
Optimal values:
tensor([0.5404, 0.4966, 0.4681, 0.4541, 0.5569, 0.0000, 0.3572, 0.0000, 0.5905, 0.6421, 0.6144, 0.0000, 0.0000, 0.7410, 0.8625, 0.0000])
>>> def extract_optimal_policy(env, V_optimal, gamma):
... """
... Obtain the optimal policy based on the optimal values
... @param env: OpenAI Gym environment
... @param V_optimal: optimal values
... @param gamma: discount factor
... @return: optimal policy
... """
... n_state = env.observation_space.n
... n_action = env.action_space.n
... optimal_policy = torch.zeros(n_state)
... for state in range(n_state):
... v_actions = torch.zeros(n_action)
... for action in range(n_action):
... for trans_prob, new_state, reward, _ in
env.env.P[state][action]:
... v_actions[action] += trans_prob * (
reward + gamma * V_optimal[new_state])
... optimal_policy[state] = torch.argmax(v_actions)
... return optimal_policy
>>> optimal_policy = extract_optimal_policy(env, V_optimal, gamma)
Take a look at the resulting optimal policy:
>>> print('Optimal policy:
', optimal_policy)
Optimal policy:
tensor([0., 3., 3., 3., 0., 3., 2., 3., 3., 1., 0., 3., 3., 2., 1., 3.])
This means the optimal action in state 0 is 0 (left), 3 (up) in state 1, etc. This doesn't look very intuitive if you look at the grid. But remember that the grid is slippery and the agent can move in another direction than the desired one.
>>> n_episode = 1000
>>> total_rewards = []
>>> for episode in range(n_episode):
... total_reward = run_episode(env, optimal_policy)
... total_rewards.append(total_reward)
Here, we reuse the run_episode
function we defined in the previous section. Then we print out the average reward:
>>> print('Average total reward under the optimal policy:', sum(total_rewards) / n_episode)
Average total reward under the optimal policy: 0.75
Under the optimal policy computed by the value iteration algorithm, the agent reaches the goal tile 75% of the time. Can we do something similar with the policy-based approach? Let's see in the next section.
The policy iteration algorithm has two components, policy evaluation and policy improvement. Similar to value iteration, it starts with an arbitrary policy and follows with a bunch of iterations.
In the policy evaluation step in each iteration, we first compute the values of the latest policy, based on the Bellman expectation equation:
In the policy improvement step, we derive an improved policy based on the latest policy values, again based on the Bellman optimality equation:
These two steps repeat until the policy converges. At convergence, the latest policy and its value are the optimal policy and the optimal value.
Let's develop the policy iteration algorithm and use it to solve the FrozenLake environment as follows:
policy_evaluation
function that computes the values of a given policy:
>>> def policy_evaluation(env, policy, gamma, threshold):
... """
... Perform policy evaluation
... @param env: OpenAI Gym environment
... @param policy: policy matrix containing actions and
their probability in each state
... @param gamma: discount factor
... @param threshold: the evaluation will stop once values
for all states are less than the threshold
... @return: values of the given policy
... """
... n_state = policy.shape[0]
... V = torch.zeros(n_state)
... while True:
... V_temp = torch.zeros(n_state)
... for state in range(n_state):
... action = policy[state].item()
... for trans_prob, new_state, reward, _ in
env.env.P[state][action]:
... V_temp[state] += trans_prob * (
reward + gamma * V[new_state])
... max_delta = torch.max(torch.abs–V - V_temp))
... V = V_temp.clone()
... if max_delta <= threshold:
... break
... return V
The function does the following tasks:
>>> def policy_improvement(env, V, gamma):
... """"""
... Obtain an improved policy based on the values
... @param env: OpenAI Gym environment
... @param V: policy values
... @param gamma: discount factor
... @return: the policy
... """"""
... n_state = env.observation_space.n
... n_action = env.action_space.n
... policy = torch.zeros(n_state)
... for state in range(n_state):
... v_actions = torch.zeros(n_action)
... for action in range(n_action):
... for trans_prob, new_state, reward, _ in
env.env.P[state][action]:
... v_actions[action] += trans_prob * (
reward + gamma * V[new_state])
... policy[state] = torch.argmax(v_actions)
... return policy
It derives a new and better policy from the input policy values, based on the Bellman optimality equation.
>>> def policy_iteration(env, gamma, threshold):
... """
... Solve a given environment with policy iteration algorithm
... @param env: OpenAI Gym environment
... @param gamma: discount factor
... @param threshold: the evaluation will stop once values for all states are less than the threshold
... @return: optimal values and the optimal policy for the given environment
... """
... n_state = env.observation_space.n
... n_action = env.action_space.n
... policy = torch.randint(high=n_action,
size=(n_state,)).float()
... while True:
... V = policy_evaluation(env, policy, gamma, threshold)
... policy_improved = policy_improvement(env, V, gamma)
... if torch.equal(policy_improved, policy):
... return V, policy_improved
... policy = policy_improved
This function does the following tasks:
>>> V_optimal, optimal_policy = policy_iteration(env, gamma, threshold)
>>> pri't('Optimal values'
', V_optimal)
Optimal values:
tensor([0.5404, 0.4966, 0.4681, 0.4541, 0.5569, 0.0000, 0.3572, 0.0000, 0.5905, 0.6421, 0.6144, 0.0000, 0.0000, 0.7410, 0.8625, 0.0000])
>>> pri't('Optimal policy'
', optimal_policy)
Optimal policy:
tensor([0., 3., 3., 3., 0., 3., 2., 3., 3., 1., 0., 3., 3., 2., 1., 3.])
We got the same results as the value iteration algorithm.
We have just solved the FrozenLake environment with the policy iteration algorithm. You may wonder how to choose between the value iteration and policy iteration algorithms. Take a look at the following table:
Table 14.1: Choosing between the policy iteration and value iteration algorithms
We solved a reinforcement learning problem using dynamic programming methods. They require a fully known transition matrix and reward matrix of an environment. And they have limited scalability for environments with many states. In the next section, we will continue our learning journey with the Monte Carlo method, which has no requirement of prior knowledge of the environment and is much more scalable.
Monte Carlo (MC)-based reinforcement learning is a model-free approach, which means it doesn't need a known transition matrix and reward matrix. In this section, you will learn about MC policy evaluation on the Blackjack environment, and solve the environment with MC Control algorithms. Blackjack is a typical environment with an unknown transition matrix. Let's first simulate the Blackjack environment.
Blackjack is a popular card game. The game has the following rules:
The Blackjack environment (https://github.com/openai/gym/blob/master/gym/envs/toy_text/blackjack.py) in Gym is formulated as follows:
Now let's simulate the Blackjack environment and explore its states and actions:
Blackjack
instance:
>>> env = gym.make('Blackjack'v0')
>>> env.reset()
(7, 10, False)
It returns the initial state (a 3-dimensional vector):
7
in this example)10
in this example)False
in this example)The usable ace variable is True
only if the player has an ace that can be counted as 11 without causing a bust. If the player doesn't have an ace, or has an ace but it busts, this state variable will become False
.
For another state example (18, 6, True), it means that the player has an ace counted as 11 and a 7, and that the dealer's revealed card is value 6.
>>> env.step(1)
((13, 10, False), 0.0, False, {})
It returns a state (13, 10, False
), a 0 reward, and the episode not being done (as in False
).
>>> env.step(1)
((19, 10, False), 0.0, False, {})
>>> env.step(0)
((19, 10, False), 1.0, True, {})
The dealer gets some cards and gets a bust. So the player wins and gets a +1 reward. The episode ends.
Feel free to play around with the Blackjack environment. Once you feel comfortable with the environment, you can move on to the next section, MC policy evaluation on a simple policy.
In the previous section, we applied dynamic programming to perform policy evaluation, which is the value function of a policy. However, it won't work in most real-life situations where the transition matrix is not known beforehand. In this case, we can evaluate the value function using the MC method.
To estimate the value function, the MC method uses empirical mean return instead of expected return (as in dynamic programming). There are two approaches to compute the empirical mean return. One is first-visit, which averages returns only for the first occurrence of a state s among all episodes. Another one is every-visit, which averages returns for every occurrence of a state s among all episodes. Obviously, the first-visit approach has a lot less computation and is therefore more commonly used. And I will only cover the first-visit approach in this chapter.
In this section, we experiment with a simple policy, where we keep adding new cards until the total value reaches 18 (or 19, or 20 if you like). We perform first-visit MC evaluation on the simple policy as follows:
>>> def run_episode(env, hold_score):
... state = env.reset()
... rewards = []
... states = [state]
... while True:
... action = 1 if state[0] < hold_score else 0
... state, reward, is_done, info = env.step(action)
... states.append(state)
... rewards.append(reward)
... if is_done:
... break
... return states, rewards
In each round of an episode, the agent takes a hit if the current score is less than hold_score
or a stick otherwise.
>>> from collections import defaultdict
>>> def mc_prediction_first_visit(env, hold_score, gamma, n_episode):
... V = defaultdict(float)
... N = defaultdict(int)
... for episode in range(n_episode):
... states_t, rewards_t = run_episode(env, hold_score)
... return_t = 0
... G = {}
... for state_t, reward_t in zip(
states_t[1::-1], rewards_t[::-1]):
... return_t = gamma * return_t + reward_t
... G[state_t] = return_t
... for state, return_t in G.items():
... if state[0] <= 21:
... V[state] += return_t
... N[state] += 1
... for state in V:
... V[state] = V[state] / N[state]
... return V
The function performs the following tasks:
n_episode
episodes under the simple Blackjack policy with function run_episode
G
for the first visit of each stateNote that here we ignore states where the player busts, since we know their values are -1.
hold_score
as 18
, the discount rate as 1
as a Blackjack episode is short enough, and will simulate 500,000 episodes:
>>> hold_score = 18
>>> gamma = 1
>>> n_episode = 500000
>>> value = mc_prediction_first_visit(env, hold_score, gamma, n_episode)
We then print the resulting values:
>>> print(value)
defaultdict(<cla's 'fl'at'>, {(20, 6, False): 0.6923485653560042, (17, 5, False): -0.24390243902439024, (16, 5, False): -0.19118165784832453, (20, 10, False): 0.4326379146490474, (20, 7, False): 0.7686220540168588, (16, 6, False): -0.19249478804725503,
……
……
(5, 9, False): -0.20612244897959184, (12, 7, True): 0.058823529411764705, (6, 4, False): -0.26582278481012656, (4, 8, False): -0.14937759336099585, (4, 3, False): -0.1680327868852459, (4, 9, False): -0.20276497695852536, (4, 4, False): -0.3201754385964912, (12, 8, True): 0.11057692307692307})
We have just computed the values for all possible 280 states:
>>> print('Number of stat's:', len(value))
Number of states: 280
We have just experienced computing the values of 280 states under a simple policy in the Blackjack environment using the MC method. The transition matrix of the Blackjack environment is not known beforehand. Moreover, obtaining the transition matrix (size 280 * 280) will be extremely costly if we go with the dynamic programming approach. In the MC-based solution, we just need to simulate a bunch of episodes and compute the empirical average values. In a similar manner, we will search for the optimal policy in the next section.
MC control is used to find the optimal policy for environments with unknown transition matrices. There are two types of MC control, on-policy and off-policy. In the on-policy approach, we execute the policy and evaluate and improve it iteratively; whereas in the off-policy approach, we train the optimal policy using data generated by another policy.
In this section, we focus on the on-policy approach. The way it works is very similar to the policy iteration method. It iterates between the following two phases, evaluation and improvement, until convergence:
Let's now search for the optimal Blackjack policy with on-policy MC control by following the steps below:
>>> def run_episode(env, Q, n_action):
... """
... Run a episode given Q-values
... @param env: OpenAI Gym environment
... @param Q: Q-values
... @param n_action: action space
... @return: resulting states, actions and rewards for the entire episode
... """
... state = env.reset()
... rewards = []
... actions = []
... states = []
... action = torch.randint(0, n_action, [1]).item()
... while True:
... actions.append(action)
... states.append(state)
... state, reward, is_done, info = env.step(action)
... rewards.append(reward)
... if is_done:
... break
... action = torch.argmax(Q[state]).item()
... return states, actions, rewards
This serves as the improvement phase. Specifically, it does the following tasks:
>>> def mc_control_on_policy(env, gamma, n_episode):
... """
... Obtain the optimal policy with on-policy MC control method
... @param env: OpenAI Gym environment
... @param gamma: discount factor
... @param n_episode: number of episodes
... @return: the optimal Q-function, and the optimal policy
... """
... G_sum = defaultdict(float)
... N = defaultdict(int)
... Q = defaultdict(lambda: torch.empty(env.action_space.n))
... for episode in range(n_episode):
... states_t, actions_t, rewards_t =
run_episode(env, Q, env.action_space.n)
... return_t = 0
... G = {}
... for state_t, action_t, reward_t in zip(
states_t[::-1], actions_t[::-1], rewards_t[::-1]):
... return_t = gamma * return_t + reward_t
... G[(state_t, action_t)] = return_t
... for state_action, return_t in G.items():
... state, action = state_action
... if state[0] <= 21:
... G_sum[state_action] += return_t
... N[state_action] += 1
... Q[state][action] =
G_sum[state_action] / N[state_action]
... policy = {}
... for state, actions in Q.items():
... policy[state] = torch.argmax(actions).item()
... return Q, policy
This function does the following tasks:
n_episode
episodes>>> gamma = 1
>>> n_episode = 500000
>>> optimal_Q, optimal_policy = mc_control_on_policy(env, gamma, n_episode)
Take a look at the optimal policy:
>>> print(optimal_policy)
{(16, 8, True): 1, (11, 2, False): 1, (15, 5, True): 1, (14, 9, False): 1, (11, 6, False): 1, (20, 3, False): 0, (9, 6, False):
0, (12, 9, False): 0, (21, 2, True): 0, (16, 10, False): 1, (17, 5, False): 0, (13, 10, False): 1, (12, 10, False): 1, (14, 10, False): 0, (10, 2, False): 1, (20, 4, False): 0, (11, 4, False): 1, (16, 9, False): 0, (10, 8,
……
……
1, (18, 6, True): 0, (12, 2, True): 1, (8, 3, False): 1, (13, 3, True): 0, (4, 7, False): 1, (18, 8, True): 0, (6, 5, False): 1, (17, 6, True): 0, (19, 9, True): 0, (4, 4, False): 0, (14, 5, True): 1, (12, 6, True): 0, (4, 9, False): 1, (13, 4, True): 1, (4, 8, False): 1, (14, 3, True): 1, (12, 4, True): 1, (4, 6, False): 0, (12, 5, True): 0, (4, 2, False): 1, (4, 3, False): 1, (5, 4, False): 1, (4, 1, False): 0}
You may wonder if this optimal policy is really optimal and better than the previous simple policy (hold at 18 points). Let's simulate 100,000 Blackjack episodes under the optimal policy and the simple policy respectively:
>>> def simulate_hold_episode(env, hold_score):
... state = env.reset()
... while True:
... action = 1 if state[0] < hold_score else 0
... state, reward, is_done, _ = env.step(action)
... if is_done:
... return reward
>>> def simulate_episode(env, policy):
... state = env.reset()
... while True:
... action = policy[state]
... state, reward, is_done, _ = env.step(action)
... if is_done:
... return reward
>>> n_episode = 100000
>>> hold_score = 18
>>> n_win_opt = 0
>>> n_win_hold = 0
>>> for _ in range(n_episode):
... reward = simulate_episode(env, optimal_policy)
... if reward == 1:
... n_win_opt += 1
... reward = simulate_hold_episode(env, hold_score)
... if reward == 1:
... n_win_hold += 1
We print out the results as follows:
>>> print(f'Winning probability:
Under the simple policy: {n_win_hold/n_episode}
Under the optimal policy: {n_win_opt/n_episode}')
Winning probability:
Under the simple policy: 0.39955
Under the optimal policy: 0.42779
Playing under the optimal policy has a 43% chance of winning, while playing under the simple policy has only 40% chance.
In this section, we solved the Blackjack environment with a model-free algorithm, MC learning. In MC learning, the Q-values are updated until the end of an episode. This could be problematic for long processes. In the next section, we will talk about Q-learning, which updates the Q-values for every step in an episode. You will see how it increases learning efficiency.
Q-learning is also a model-free learning algorithm. It updates the Q-function for every step in an episode. We will demonstrate how Q-learning is used to solve the Taxi environment. It is a typical environment with relatively long episodes. So let's first simulate the Taxi environment.
In the Taxi environment (https://gym.openai.com/envs/Taxi-v3/) the agent acts as a taxi driver to pick up the passenger from one location and drop off the passenger at the destination.
All subjects are on a 5 * 5 grid. Take a look at the following example:
Figure 14.6: Example of the Taxi environment
Tiles in certain colors have the following meanings:
The starting positions of the empty taxi and the passenger and the passenger's destination are randomly assigned in each episode.
The four letters R, Y, B, and G are the only four locations that allow passenger pick-up and drop-off. The one in purple is the destination, and the one in blue is the passenger's location.
The taxi can take any of the following six actions:
There is a pillar "|" between two tiles, which prevents the taxi from moving between two tiles.
In each step, the reward follows these rules:
Last but not least, there are actually 500 possible states: obviously the taxi can be on any of the 25 tiles, the passenger can be on any of R, Y, B, G or inside the taxi, and the destination can be any of R, Y, B, G; hence, we have 25 * 5 * 4 = 500 possible states.
Now let's play around with the environment as follows:
>>> env = gym.make('Taxi-v3')
>>> n_state = env.observation_space.n
>>> print(n_state)
500
>>> n_action = env.action_space.n
>>> print(n_action)
6
We also know that the state is represented by an integer ranging from 0 to 499, and there are 6 possible actions.
>>> env.reset()
262
>>> env.render()
You will see a 5 * 5 grid similar to the following one:
Figure 14.7: Example starting step of the Taxi environment
The passenger is on the blue R tile, and the destination is on the purple Y.
>>> print(env.step(3))
(242, -1, False, {'prob': 1.0})
>>> print(env.step(3))
(222, -1, False, {'prob': 1.0})
>>> print(env.step(3))
(202, -1, False, {'prob': 1.0})
>>> print(env.step(1))
(102, -1, False, {'prob': 1.0})
>>> print(env.step(1))
(2, -1, False, {'prob': 1.0})
>>> print(env.step(4))
(18, -1, False, {'prob': 1.0})
Render the environment:
>>> env.render()
Figure 14.8: Example of a state where the passenger is inside the taxi
The taxi turns green, meaning the passenger is inside the taxi.
>>> print(env.step(0))
(118, -1, False, {'prob': 1.0})
>>> print(env.step(0))
(218, -1, False, {'prob': 1.0})
>>> print(env.step(0))
(318, -1, False, {'prob': 1.0})
>>> print(env.step(0))
(418, -1, False, {'prob': 1.0})
>>> print(env.step(5))
(410, 20, True, {'prob': 1.0})
A +20 reward is granted in the end for a successful drop-off.
>>> env.render()
Figure 14.9: Example of a state where the passenger arrives at the destination
You can take some random actions and see how difficult it is for a model to solve the environment. We will discuss the Q-learning algorithm in the next section.
Q-learning is an off-policy learning algorithm that optimizes the Q-values based on data generated by a behavior policy. The behavior policy is a greedy policy where it takes actions that achieve the highest returns for given states. The behavior policy generates learning data and the target policy (the policy we attempt to optimize) updates the Q-values based on the following equation:
Here, is the resulting state after taking action a from state s and r is the associated reward. means that the behavior policy generates the highest Q-value given state . Finally, hyperparameters and are the learning rate and discount factor respectively.
Learning from experience generated by another policy enables Q-learning to optimize its Q-values in every single step in an episode. We gain the information from a greedy policy and use this information to update the target values right away.
One more thing to note is that the target policy is epsilon-greedy, meaning it takes a random action with a probability of (value from 0 to 1) and takes a greedy action with a probability of . The epsilon-greedy policy combines exploitation and exploration: it exploits the best action while exploring different actions.
Now it is time to develop the Q-learning algorithm to solve the Taxi environment:
>>> def gen_epsilon_greedy_policy(n_action, epsilon):
... def policy_function(state, Q):
... probs = torch.ones(n_action) * epsilon / n_action
... best_action = torch.argmax(Q[state]).item()
... probs[best_action] += 1.0 - epsilon
... action = torch.multinomial(probs, 1).item()
... return action
... return policy_function
Given |A| possible actions, each action is taken with a probability , and the action with the highest state-action value is chosen with an additional probability .
>>> epsilon = 0.1
>>> epsilon_greedy_policy = gen_epsilon_greedy_policy(env.action_space.n, epsilon)
Here, , which is the exploration ratio.
>>> def q_learning(env, gamma, n_episode, alpha):
... """
... Obtain the optimal policy with off-policy Q-learning method
... @param env: OpenAI Gym environment
... @param gamma: discount factor
... @param n_episode: number of episodes
... @return: the optimal Q-function, and the optimal policy
... """
... n_action = env.action_space.n
... Q = defaultdict(lambda: torch.zeros(n_action))
... for episode in range(n_episode):
... state = env.reset()
... is_done = False
... while not is_done:
... action = epsilon_greedy_policy(state, Q)
... next_state, reward, is_done, info =
env.step(action)
... delta = reward + gamma * torch.max(Q[next_state]) -
Q[state][action]
... Q[state][action] += alpha * delta
... length_episode[episode] += 1
... total_reward_episode[episode] += reward
... if is_done:
... break
... state = next_state
... policy = {}
... for state, actions in Q.items():
... policy[state] = torch.argmax(actions).item()
... return Q, policy
We first initialize the Q-table. Then in each episode, we let the agent take actions following the epsilon-greedy policy, and update the Q function for each step based on the off-policy learning equation. We run n_episode
episodes and finally obtain the optimal policy and Q-values.
>>> n_episode = 1000
>>> length_episode = [0] * n_episode
>>> total_reward_episode = [0] * n_episode
>>> gamma = 1
>>> alpha = 0.4
>>> optimal_Q, optimal_policy = q_learning(env, gamma, n_episode, alpha)
Here, discount rate , and learning rate .
>>> import matplotlib.pyplot as plt
>>> plt.plot(total_reward_episode)
>>> plt.title('Episode reward over time')
>>> plt.xlabel('Episode')
>>> plt.ylabel('Total reward')
>>> plt.ylim([-200, 20])
>>> plt.show()
Refer to the following screenshot for the end result:
Figure 14.10: Total rewards over episodes
The total rewards keep improving during learning. And they stay around +5 after 600 episodes.
>>> plt.plot(length_episode)
>>> plt.title('Episode length over time')
>>> plt.xlabel('Episode')
>>> plt.ylabel('Length')
>>> plt.show()
Refer to the following screenshot for the end result:
Figure 14.11: Episode lengths over episodes
As you can see, the episode lengths decrease from the maximum 200 to around 10, and the model converges around 600 episodes. It means after training, the model is able to solve the problem in around 10 steps.
In this section, we solved the Taxi problem with off-policy Q-learning. The algorithm optimizes the Q-values in every single step by learning from the experience generated by a greedy policy.
We started the chapter by setting up the working environment. After that, we studied the fundamentals of reinforcement learning along with a few examples. After exploring the FrozenLake environment, we solved it with two dynamic programming algorithms, value iteration and policy iteration. We talked about Monte Carlo learning and used it for value approximation and control in the Blackjack environment. Lastly, we developed the Q-learning algorithm and solved the Taxi problem.