Dynamic games

Now that we have learned the world's simplest game, let's try learning something a bit more dynamic. The cart pole task is a classic reinforcement learning problem. The agent must control a cart, on which is balanced a pole, attached to the cart via a joint. At every step, the agent can choose to move the cart left or right, and it receives a reward of 1 every time step that the pole is balanced. If the pole ever deviates by more than 15 degrees from upright, then the game ends:

Dynamic games

Figure 5: The cart pole task

To run the cart pole task, we will use OpenAIGym, an open source project set up in 2015, which gives a way to run reinforcement learning agents against a range of environments in a consistent way. At the time of writing, OpenAIGym has support for running a whole range of Atari games and even some more complex games, such as doom, with minimum setup. It can be installed using pip by running this:

pip install gym[all]

Running cart pole in Python can be done as follows:

import gym

env = gym.make('CartPole-v0')
current_state = env.reset()

The gym.make method creates the environment that our agent will run in. Passing in the "CartPole-v0" string tells OpenAIGym that we want this to be the cart pole task. The returned env object is used to interact with the cart pole game. The env.reset() method puts the environment into its initial state, returning an array that describes it. Calling env.render() will display the current state visually, and subsequent calls to env.step(action) allow us to interact with the environment, returning the new states in response to the actions we call it with.

In what ways will we need to modify our simple 1-D game code in order to learn the cart-pole challenge? We no longer have access to a well-defined position; instead, the cart pole environment gives us as input an array of four floating point values that describe the position and angle of the cart and pole. These will be the input into our neural network, which will consist of one hidden layer with 20 nodes and a tanh activation function, leading to an output layer with two nodes. One output node will learn the expected reward for a move left in the current state, the other the expected reward for a move right. Here is what that code looks like (the full code sample is in deep_q_cart_pole.py in the git repo):

feed_forward_weights_1 = tf.Variable(tf.truncated_normal([4,20], stddev=0.01))
feed_forward_bias_1 = tf.Variable(tf.constant(0.0, shape=[20]))

feed_forward_weights_2 = tf.Variable(tf.truncated_normal([20,2], stddev=0.01))
feed_forward_bias_2 = tf.Variable(tf.constant(0.0, shape=[2]))

input_placeholder = tf.placeholder("float", [None, 4])
hidden_layer = tf.nn.tanh(tf.matmul(input_placeholder, feed_forward_weights_1) + feed_forward_bias_1)
output_layer = tf.matmul(hidden_layer, feed_forward_weights_2) + feed_forward_bias_2 

Why one hidden layer with 20 nodes? Why use a tanh activation function? Picking hyperparameters is a dark art; the best answer I can give is that when tried, these values worked well. But knowing that they worked well in practice and knowing something about what kind of level of complexity is needed to solve the cart pole problem, we can make a guess about why that may guide us in picking hyperparameters for other networks and tasks.

One rule of thumb for the number of hidden nodes in supervised learning is that it should be somewhere in between the number of input nodes and the number of output nodes. Often two-thirds of the number of inputs is a good region to look at. Here, however, we have chosen 20, five times larger than the number of input nodes. In general, there are two reasons for favoring fewer hidden nodes: the first is computation time, fewer units means our network is quicker to run and train. The second is to reduce overfitting and improve generalization. You will have learned from the previous chapters about overfitting and how the risk of having too complex a model is that it learns the training data exactly, but has no ability to generalize to new data points.

In reinforcement learning, neither of these issues is as important. Though we care about computation time, often a lot of the bottleneck is time spent running the game; so a few extra nodes is of less concern. For the second issue, when it comes to generalization, we don't have a division of test set and training set, we just have an environment in which an agent gets a reward. So overfitting is not something we have to worry about (until we start to train agents that can operate across multiple environments). This is also why you often won't see reinforcement learning agents use regularizers. The caveat to this is that over the course of training, the distribution of our training set may change significantly as our agent changes over the course of training. There is always the risk that the agent may overfit to the early samples we got from our environment and cause learning to become more difficult later.

Given these issues, it makes sense to choose an arbitrary large number of nodes in the hidden layers in order to give the maximum chances of learning complex interactions between inputs. But the only true way to know is testing. Figure 6 shows the results of running a neural network with three hidden nodes against the cart pole task. As you can see, though it is able to learn eventually, it performs a lot worse than with 20 hidden nodes as shown in Figure 7:

Dynamic games

Figure 6: Cart pole with three hidden nodes, y = average reward of last 10 games, x = games played

Why only one hidden layer? The complexity of the task can help us estimate this. If we think about the cart pole task, we know that we care about the inter-relationship of input parameters. The position of the pole may be good or bad depending on the position of the cart. This level of interaction means that a purely linear combination of weights may not be enough. This guess can be confirmed by a quick run, which will show that though a network with no hidden layers can learn this task better than random, it performs a lot less well than a single hidden layer network.

Would a deeper network be better? Maybe, but for tasks that only have this kind of slight complexity, more layers tend not to improve things. Running the network will confirm extra hidden layers appear to make little difference. One hidden layer gives us the capacity we need to learn the things we want in this task.

As for the choice of tanh, there are a few factors to think about. The reason relu activation functions have been popular for deep networks is because of saturation. When running a many-layered network with activation functions bounded to a narrow range, for example the 0 to 1 of a logistic function, lots of nodes will learn to activate at close to the maximum of 1. They saturate at 1. But we often want something to signal to a greater degree when it has a more extreme input. This is why relu has been so popular—it gives non-linearity to a layer while not bounding its maximum activation. This is especially important in many layered networks because early layers may get extreme activations that it is useful to signal forward to future layers.

With only one layer, this is not a concern, so a sigmoid function makes sense. The output layer will be able to learn to scale the values from our hidden layer to what they need to be. Is there any reason to favor tanh over the logistic function? We know that our target will sometimes be negative, and that for some combinations of parameters can be either good or bad depending on their relative values. That would suggest that the range of -1 to 1 provided by the tanh function might be preferable to the logistic function, where to judge negative associations, the bias would first have to be learned. This is a lot of conjecture and reasoning after the fact; the best answer is ultimately that this combination works very well on this task, but hopefully, it should give some feeling for where to start guessing at the best hyperparameters when presented with other similar problems.

To get back to the code, here is what our loss and train functions will look like for our cart pole task:

action_placeholder = tf.placeholder("float", [None, 2])
target_placeholder = tf.placeholder("float", [None])

q_value_for_state_action = tf.reduce_sum(tf.mul(output_layer, action_placeholder),reduction_indices=1)

The q_value_for_state_action variable will be the q-value that the network predicts for a given state and action. Multiplying output_layer by the action_placeholder vector, which will be 0 for everything except for a 1 for the action we took, and then summing across that means that our output will be our neural networks approximation for the expected value for just that action:

cost = tf.reduce_mean(tf.square(target_placeholder – 
                        q_value_for_state_action))
train_operation = tf.train.RMSPropOptimizer(0.001).minimize(cost)

Our cost is the difference between what we think is the expected return of the state and action and what it should be as defined by the target_placeholder.

One of the downsides to the policy gradient approach described in Chapter 7, Deep Learning for Board Games, is that all training must be done against the environment. A set of policy parameters can only be evaluated by seeing its effect on the environments reward. With Q-learning, we are instead trying to learn how to value a state and action. As our ability to value specific states improves, we can use that new information to better value the previous states we have experienced. So, rather than always training on the currently experienced state, we can have our network store a history of states and train against those. This is known as experience replay.

Experience replay

Every time we take an action and get into a new state, we store a tuple of previous_state, action_taken, next_reward, next_state, and next_terminal. These five pieces of information are all we need to run a q-learning training step. As we play the game, we will store this as a list of observations.

Another difficulty that experience replay helps solve is that in reinforcement learning, it can be very hard for training to converge. Part of the reason for this is that the data we train on is very heavily correlated. A series of states experienced by a learning agent will be closely related; a time series of states and actions leading to a reward if trained on together will have a large impact on the weights of the network and can undo a lot of the previous training. One of the assumptions of neural networks is that the training samples are all independent samples from a distribution. Experience replay helps with this problem because we can have our training mini-batches be randomly sampled from our memory, making it unlikely that samples are correlated.

A learning algorithm that learns from memories is called an off-line learning algorithm. The other approach is on-line learning, in which we are only able to adjust the parameters based on direct play of the game. Policy gradients, genetic algorithms, and cross-entropy methods are all examples of this.

Here is what the code for running cart pole with experience replay looks like:

from collections import deque
observations = deque(maxlen=20000)
last_action = np.zeros(2)
last_action[0] = 1
last_state = env.reset()

We start with our observations collection. A deque in Python is a queue that once hits capacity will start removing items from the beginning of the queue. Making the deque here has a maxlen of 20,000, which means we will only store the last 20,000 observations. We also create the last action, np.array, which will store the action we decided on from the previous main loop. It will be a one-hot vector:

while True:
    env.render()
    last_action = choose_next_action(last_state)
    current_state, reward, terminal, _ = env.step(np.argmax(last_action))

This is the main loop. We will first render the environment, then decide on an action to take based on the last_state we were in, then take that action so as to get the next state:

    if terminal:
        reward = -1

The cart pole task in OpenAIGym always gives a reward of 1 for every time step. We will force giving a negative reward when we hit the terminal state so the agent has a signal to learn to avoid it:

    observations.append((last_state, last_action, reward, current_state, terminal))
    if len(observations) > 10000:
        train()

We store the information for this transition in our observations array. We can also start training if we have enough observations stored. It is important to only begin training once we have a good number of samples, otherwise a few early observations could heavily bias training:

   if terminal:
        last_state = env.reset()
   else:
        last_state = current_state

If we are in a terminal state, we need to reset our env so as to give us a fresh state of the game. Otherwise, we can just set last_state to be the current_state for the next training loop. We also now need to decide what action to take based on the state. Then here is the actual train method, using the same steps as our earlier 1-D example, but changed to use samples from our observations:

def _train():
    mini_batch = random.sample(observations, 100)

Take 100 random items from our observations; these will be the mini_batch to train on:

    previous_states = [d[0] for d in mini_batch]
    actions = [d[1] for d in mini_batch]
    rewards = [d[2] for d in mini_batch]
    current_states = [d[3] for d in mini_batch]

Unpack the mini_batch tuples into separate lists for each type of data. This is the format we need to feed into our neural network:

    agents_reward_per_action = session.run(_output_layer, feed_dict={input_layer: current_states})

Get the reward for each current_state as predicted by our neural network. Output here will be an array of the size of the mini_batch, where each item is an array with two elements, the estimate of the Q-value for taking the action move left, and the estimate for taking the action move right. We take the max of these to get the estimated Q-value for the state. Successive training loops will improve this estimate towards the true Q-value:

    agents_expected_reward = []
    for i in range(len(mini_batch)):
        if mini_batch[i][4]:
            # this was a terminal frame so there is no future reward...
            agents_expected_reward.append(rewards[i])
        else:
            agents_expected_reward.append(rewards[i] + FUTURE_REWARD_DISCOUNT * np.max(agents_reward_per_action[i]))

Augment the rewards we actually got with the rewards our network predicts if it's a non-terminal state:

    session.run(_train_operation, feed_dict={
        input_layer: previous_states,
        action: actions,
        target: agents_expected_reward})

Finally, run the training operation on the network.

Epsilon greedy

Another issue Q-learning has is that initially, the network will be very poor at estimating the rewards of actions. But these poor action estimations are the things that determine the states we move into. Our early estimates may be so bad that we may never get into a reward state from which we would be able to learn. Imagine if in the cart pole the network weights are initialized so the agent always picks to go left and hence fails after a few time steps. Because we only have samples of moving left, we will never start to adjust our weights for moving right and so will never be able to find a state with better rewards.

There are a few different solutions to this, such as giving the network a reward for getting into novel situations, known as novelty search, or using some kind of modification to seek out the actions with the greatest uncertainty.

The simplest solution and one that has been shown to work well is to start by choosing actions randomly so as to explore the space, and then over time as the network estimations get better and better, replace those random choices with actions chosen by the network. This is known as the epsilon greedy strategy and it can be used as an easy way to implement exploration for a range of algorithms. The epsilon here refers to the variable that is used for choosing whether a random action is used, greedy refers to taking the maximum action if not acting randomly. In the cart pole example, we will call this epsilon variable probability_of_random_action. It will start at 1, meaning 0 chance of a random action, and then at each training step, we will reduce it by some small amount until it hits 0:

if probability_of_random_action > 0.and len(_observations) > OBSERVATION_STEPS:
    probability_of_random_action -= 1. / 10000

In the final step, we need the method that changes our neural network output into the action of the agent:

def choose_next_action():
    if random.random() <= probability_of_random_action:
        action_index = random.randrange(2)

Choose an action randomly if a random value comes up less than probability_of_random_action; otherwise choose the max output of our neural network:

    else:
        readout_t = session.run(output_layer, feed_dict={input_layer: [last_state]})[0]
        action_index = np.argmax(readout_t)
    new_action = np.zeros([2])
new_action[action_index] = 1
return new_action

Here is a graph of training progress against the cart pole task:

Epsilon greedy

Figure 7: Cart pole task, y = average length of game over last 10 games x = number of games played

This looks good. Success for the cart pole task is defined as being able to last over 200 turns. After 400 games, we beat that comfortably averaging well over 300 turns per game. Because we set this learning task up using OpenAIGym, it is now easy to set up against other games. All we need to do is change the gym.make line to take a new input game string as input and then adjust the number of inputs and outputs to our network to fit that game. There are a few other interesting control tasks in OpenAIGym, such as the pendulum and acrobat, which q-learning should also do well on, but as a challenge, let's look at playing some Atari games.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset