Policy gradients for learning policy functions

The problem policy gradients aims to solve is a more general version of the problem of reinforcement learning, which is how you can use backpropagation on a task that has no gradient, from the reward to the output of our parameters. To give a more concrete example, we have a neural network that produces the probability of taking an action a, given a state s and some parameters ?, which are the weights of our neural network:

Policy gradients for learning policy functions

We also have our reward signal R. The actions affect the reward signal we take, but there is no gradient between them and the parameters. There is no equation in which we can plug R; it is just a value we obtain from our environment in response to a.

However, given that we know there is a link between the a we choose and R, there are a few things we could try. We could create a range of values for our ? from a Gaussian distribution and run them in the environment. We could then select a percentage of the most successful group and get their mean and variance. We then create a new population of ? using the new mean and variance in our Gaussian distribution. We can keep doing this iteratively until we stop seeing improvements in R and then use our final mean as the best choice for our parameters. This method is known as the Cross Entropy Method.

Though it can be quite successful, it is a hill-climbing method, which does not do a good job of exploring the space of possibilities. It is very likely to get stuck in local optima, which is very common in reinforcement learning. Also, it still does not take advantage of gradient information.

To use gradients, we can take advantage of the fact that although there is no mathematical relationship between a and R, there is a probabilistic one. Certain a taken in certain s will tend to receive more R than others. We can write the problem of getting the gradients of ? with respect to R as follows:

Policy gradients for learning policy functions

Here, r t is the reward at time step t. This can be rearranged into:

Policy gradients for learning policy functions

If we multiply and divide it by Policy gradients for learning policy functions, we have the following:

Policy gradients for learning policy functions

Use the fact Policy gradients for learning policy functions and simplify it to the following:

Policy gradients for learning policy functions

What this amounts to is if we nudge our parameters along the log of the direction of the gradient of the reward at each time step, we tend to move towards the gradient of the reward across all time steps. To implement this in Python, we will need to take the following steps:

  1. Create a neural network whose output is the probability of taking different actions, given an input state. In terms of the preceding equations, it will represent Policy gradients for learning policy functions.
  2. Run a batch of episodes with our agent running in its environment. Select its actions randomly according to the probability distribution output of the network. At every time step, record the input state, reward received, and the action you actually took.
  3. At the end of each episode of training, assign rewards to each step using the sum of rewards in the episode from that point on. In the case of a game such as Go, this will just be 1, 0, or -1 representing the final result applied to each step. This will represent rt in the equations. For more dynamic games, discounted rewards can be used; discounted rewards will be explained in detail in the next chapter.
  4. Once we have stored a set number of states running our episodes, we train them by updating our network parameters based on the log of our network output times the actual move that was taken, times the reward. This is used as a loss function of our neural network. We do this for each time step as a single batch update.
  5. This is then repeated from step 2 until we hit a stopping point, either at some number of iterations or some score within the environment.

The effect of this loop is that if an action is associated with positive rewards, we increase the parameters that lead to this action in that state. If the reward is negative, we decrease the parameters leading to the action. Note that for this to work, it requires us to have some negative valued rewards; otherwise, over time, all actions are simply pulled up. The best option if this does not occur naturally is to normalize our rewards in each batch.

The policy gradient approach has been shown to be successful at learning a range of complex tasks, although it can take a very long time to train well and is very sensitive to the learning rate. Too high the learning rate and the behavior will oscillate wildly, never becoming stable enough to learn anything of note. Too low and it will never converge. This is why in the following example, we use RMSProp as the optimizer. Standard gradient descent with a fixed learning rate is often a lot less successful. Also, although the example shown here is for board games, policy gradients also work very well for learning more dynamic games, such as Pong.

Now let's create player_func for our tic-tac-toe's play_game method; it uses policy gradients to learn the optimal play. We will set up the neural network that will take the nine squares of the board as input. The number 1 will be a mark for the player, -1 for the opponent, and 0 an unmarked square. Here the network will be set up with three hidden layers, each with 100 hidden nodes and relu activation functions. The output layer will also contain nine nodes, one for each board square. Because we want our final output to be the probability of a move being the best one, we want the output of all the nodes in the final layer to sum to 1. This means using a softmax activation function is a natural choice. The softmax activation function looks as follows:

Policy gradients for learning policy functions

Here, x and y are vectors with equal dimensions.

Here is the code for creating the network in TensorFlow. The full code can also be found in the GitHub repo in policy_gradients.py:

import numpy as np
import tensorflow as tf

HIDDEN_NODES = (100, 100, 100) 
INPUT_NODES = 3 * 3 
LEARN_RATE = 1e-4
OUTPUT_NODES = INPUT_NODES

First, we import NumPy and TensorFlow, which will be used for the network, and create a few constant variables, which will be used later. The 3 * 3 input nodes is the size of the board:

input_placeholder = tf.placeholder("float", shape=(None, INPUT_NODES))

The input_placeholder variable is the placeholder that holds the input to the neural network. In TensorFlow, placeholder objects are used for all values provided to the network. When running the network, it will be set to board_state of the game. Also, the first dimension of input_placeholder is None. This is because, as mentioned a few times in this book, training mini-batching is much faster. The None will adjust to become the size of our mini-batch of samples come training time:

hidden_weights_1 = tf.Variable(tf.truncated_normal((INPUT_NODES, HIDDEN_NODES[0]), stddev=1. / np.sqrt(INPUT_NODES)))
hidden_weights_2 = tf.Variable(
tf.truncated_normal((HIDDEN_NODES[0], HIDDEN_NODES[1]), stddev=1. / np.sqrt(HIDDEN_NODES[0])))
hidden_weights_3 = tf.Variable(
tf.truncated_normal((HIDDEN_NODES[1], HIDDEN_NODES[2]), stddev=1. / np.sqrt(HIDDEN_NODES[1])))
output_weights = tf.Variable(tf.truncated_normal((HIDDEN_NODES[-1], OUTPUT_NODES), stddev=1. / np.sqrt(OUTPUT_NODES)))

Here we create the weights we will need for the three layers of our network. They will all be created with a random Xavier initialization; more on this in chapter:

hidden_layer_1 = tf.nn.relu(
    tf.matmul(input_placeholder, hidden_weights_1) +
    tf.Variable(tf.constant(0.01, shape=(HIDDEN_NODES[0],))))

Create the first hidden layer, our hidden_weights_1 2d tensor, and matrix multiply it by input_placeholder. Then add the bias variable, tf.Variable(tf.constant(0.01, shape=(HIDDEN_NODES[0],))), which gives the network a bit more flexibility in learning patterns. The output is then put through a relu activation function: tf.nn.relu. This is how we write the basic equation for a layer of a neural network in TensorFlow. The other thing to note is 0.01. When using the relu function, it is good practice to add a small amount of positive bias. This is because the relu function is the maximum value and is 0. This means that values below 0 will have no gradient and so will not be adjusted during learning. If node activation is always below zero, because of bad luck with weight initialization, then it is considered a dead node and will never have an impact on the network and will simply take up GPU/CPU cycles. A small amount of positive bias greatly reduces the chance of having completely dead nodes in the network:

hidden_layer_2 = tf.nn.relu(
tf.matmul(hidden_layer_1, hidden_weights_2) + 
tf.Variable(tf.truncated_normal((HIDDEN_NODES[1],), stddev=0.001)))
hidden_layer_3 = tf.nn.relu(
tf.matmul(hidden_layer_2, hidden_weights_3) + tf.Variable(tf.truncated_normal((HIDDEN_NODES[2],), stddev=0.001)))
output_layer = tf.nn.softmax(tf.matmul(hidden_layer_3, output_weights) + tf.Variable(tf.truncated_normal((OUTPUT_NODES,), stddev=0.001)))

The next few layers are created in the same way:

reward_placeholder = tf.placeholder("float", shape=(None,))
actual_move_placeholder = tf.placeholder("float", shape=(None, OUTPUT_NODES))

For the loss function, we need two additional placeholders. One of them is for the reward we receive from the environment, in this case, the result of our game of tic-tac-toe. The other is meant for the actual action we will take at each time step. Remember, we will choose our moves according to a stochastic policy based on the output of our network. When we adjust our parameters, we need to know the actual move we took so we can move the parameters towards it if we have a positive reward and away from it if the reward is negative:

policy_gradient = tf.reduce_sum(
    tf.reshape(reward_placeholder, (-1, 1)) *     
actual_move_placeholder * output_layer)

train_step = tf.train.RMSPropOptimizer(LEARN_RATE).minimize(-policy_gradient)

The actual_move_placeholder when activated will be a one hot vector, for example, [0, 0, 0, 0, 1, 0, 0, 0, 0], with 1 being the square in which the actual move was played. This will act as a mask across output_layer, so that only the gradients of that move are adjusted. Success or failure in moving to the first square tells us nothing about the success or failure of moving to the second square. Multiplying it by reward_placeholder tells us whether we want to increase the weights leading to this move or reduce them. We then put policy_gradient into our optimizer; we want to maximize our reward, which means minimizing the inverse of it.

One final point is that here we are using RMSPropOptimizer. As mentioned before, policy gradients are very sensitive to the learning rate and type used. RMSProp has been shown to work well.

Within TensorFlow, variables also need to be initialized within a session; this session will then be used to run our calculations:

sess = tf.Session()
sess.run(tf.initialize_all_variables())

Now we need a method for running our network to choose the actions that will be passed in to the play_game method that we created previously:

board_states, actual_moves, rewards = [], [], []

def make_move(board_state):
    board_state_flat = np.ravel(board_state)
    board_states.append(board_state_flat)
    probability_of_actions = sess.run(output_layer, feed_dict={input_placeholder: [board_state_flat]})[0]

In the make_move method, we do a few different things. First, we flatten board_state, which starts as the second array in a one-dimensional array that we need to use as input for the network. We then append that state to our board_states list so we can later use it for training, once we have the reward for the episode. We then run the network using our TensorFlow session: probability_of_actions. There will now be an array with nine numbers that will sum up to one; these are the numbers that the network will learn to have the probability where it can set each move as the current most favorable:

try:
        move = np.random.multinomial(1, probability_of_actions)
except ValueError:
        move = np.random.multinomial(1, probability_of_actions / (sum(probability_of_actions) + 1e-7))

We now use probability_of_actions as the input to a multinomial distribution. The np.random.multinomial returns a series of values from the distribution you pass. Because we gave 1 for the first argument, only a single value will be generated; this is the move we will make. The try…catch around the multinomial call exists because owing to the small rounding errors, probability_of_actions sometimes sums up to be greater than 1. This only happens roughly once every 10,000 calls, so we will be pythonic; if it fails, simply adjust it by some small epsilon and try again:

    actual_moves.append(move)

    move_index = move.argmax()
    return move_index / 3, move_index % 3

The last bit of the make_move method is that we need to store the move we actually used later in training. Then return the move to the format that our Tic-Tac-Toe game expects it in, which is as a tuple of two integers: one for the x position and one for the y position.

The final step before training is that once we have a complete batch to train on, we need to normalize the rewards from the batch. There are a few advantages to this. First, during early training, when it is losing or winning almost all games, we want to encourage the network to move towards better examples. Normalizing will allow us to have that extra weight applied to the rare examples that are more significant. Also, batch normalization tends to speed up training because it reduces the variance in targets:

BATCH_SIZE = 100
episode_number = 1

We define a constant for how big our BATCH_SIZE is. This defines how many examples go into our mini-batches for training. Many different values of this work well; 100 is one of these. episode_number will keep track of how many game loops we have done. This will track when we need to kick off a mini-batch training:

while True:
    reward = play_game(make_move, random_player)

while True puts us into the main loop. The first step we need to make here is to run a game using our old friend, the play_game method from earlier in the chapter. For simplicity's sake, we will always have the policy gradient player, using the make_move method as the first player and random_player as the second player. It would not be too difficult to change it to alternate the order of moves:

    
last_game_length = len(board_states) - len(rewards)

    # we scale here 
    reward /= float(last_game_length)
    
rewards += ([reward] * last_game_length)

Get the length of the game we just played and append the reward we received for it to the rewards array so that each board state gets the same final reward we received. In reality, some moves may have had more or less impact on the final reward than others, but we cannot know that here. We will hope that through training, with similar good states showing up with positive rewards more often, the network will learn this over time. We also scale the reward by last_game_length, so winning quickly is better than winning slowly and losing slowly is better than losing quickly. Another point to note is if we were running a game with a more unevenly distributed reward—such as Pong, where most frames would have 0 reward with the occasional one—this is where we might apply future discounting across the time steps of the episode:

    
episode_number += 1

    if episode_number % BATCH_SIZE == 0:
        normalized_rewards = rewards - np.mean(rewards)
        normalized_rewards /= np.std(normalized_rewards)

        sess.run(train_step, feed_dict={input_placeholder: board_states, reward_placeholder: normalized_rewards, actual_move_placeholder: actual_moves})

Increment episode_number, and if we have a BATCH_SIZE set of samples, jump into the training code. We start this by doing batch normalization on our rewards. This is not always necessary, but it is almost always advisable because it has many benefits. It tends to improve training time by reducing variance across training. If we have issues with all rewards being positive/negative, this solves them without you having to give it a second thought. Finally, kick off the training by running the train_step operation through the TensorFlow session object:

        del board_states[:]
        del actual_moves[:]
        del rewards[:]

Finally, clear the current mini-batch to make way for the next one. Now let's see how policy gradients perform:

Policy gradients for learning policy functions

As you can see, it eventually achieves a respectable 85 percent winning rate. With more time and tuning of hyper-parameters, it could do even better. Also, note the reason that indicates why a random player who only chooses valid moves has a greater than 50 percent winning rate. This is because here, the observed player always goes first.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset