Atari Breakout

Breakout is a classic Atari game originally released in 1976. The player controls a paddle and must use it to bounce a ball into the colored blocks at the top of the screen. Points are scored whenever a block is hit. If the ball travels down past the paddle off the bottom of the screen, the player loses a life. The game ends either when the all the blocks have been destroyed or if the player loses all three lives that he starts with:

Atari Breakout

Figure 8: Atari Breakout

Think about how much harder learning a game like Breakout is compared to the cart pole task we just looked at. For cart pole, if a bad move is made that leads to the pole tipping over, we will normally receive feedback within a couple of moves. In Breakout, such feedback is much rarer. If we position our paddle wrong, that can be because of 20 or more moves that went into positioning.

Atari Breakout random benchmark

Before we go any further, let's create an agent that will play Breakout by selecting moves randomly. That way we will have a benchmark against which to judge out a new agent:

from collections import deque

import random
import gym
import numpy as np

env = gym.make("Breakout-v0")
observation = env.reset()
reward_per_game = 0
scores = dequeu(maxlen=1000)

while True:
    env.render()

    next_action = random.randint(1, 3)
    observation, reward, terminal, info = env.step(next_action)
    reward_per_game += reward

We select our move randomly; in Breakout, the moves are as follows:

  • 1: Move left
  • 2: Stay still
  • 3: Move right
    if terminal:
        scores.append(reward_per_game)
        reward_per_game = 0
        print(np.mean(scores))
        env.reset()

If we've come to the end of our game, then we store our score, print it, and call env.reset() to keep playing. If we let this run for a few hundred games, we can see that random breakout tends to score around 1.4 points per game. Let's see how much better we can do with Q-learning.

The first issue we must deal with adapting from our cart pole task is that the state space is so much larger. Where the cart pole input was a set of four numbers for Breakout, it is the full screen of 210 by 160 pixels, with each pixel containing three floats, one for each color. To understand the game, those pixels must be related to blocks, paddles, and balls, and then the interaction between those things must be on some level computed. To make things even more difficult, a single image of the screen is not enough to understand what is going on in the game. The ball is moving over time with velocity; to understand the best move, you cannot just rely on the current screen image.

There are three approaches to dealing with this: one is to use a recurrent neural network that will judge the current state based on the previous output. This approach can work, but it is a lot more difficult to train. Another approach is to supply the screen input as a delta between the current frame and the last. In Figure 9, you will see an example of this. Both frames have been converted to grayscale because the color is providing us no information in Pong. The image of the previous frame has been subtracted from the image of the current frame. This allows you to see the path of the ball and the direction in which both paddles are moving:

Atari Breakout random benchmark

Figure 9: Pong delta images

This approach can work well for games such as Pong, which are only composed of moving elements, but for a game such as Breakout, where blocks are in fixed positions, we would be losing important information about the state of the world. Indeed, we would only ever be able to see a block for a brief flash when it was hit, while blocks we had not yet hit would remain invisible.

The third approach that we will take for Breakout is to set the current state to be the image of the last n states of the game, where n is 2 or more. This allows the neural network to have all the information it needs to make a good judgment about the state of the game. Using an n of 4 is a good default value for most games; but for Breakout, n of 2 has been found to be sufficient. It is good to use as low a value for n as possible because this reduces the number of parameters that our network will need.

Preprocessing the screen

The full code for this is in deep_q_breakout.py in the Git repo. But we will go through a few of the important modifications from the cart pole example here. The first is the type of neural network. For the cart pole, a network with a single hidden layer sufficed. But that involved four values being mapped to just two actions. Now we are working with screen_width * screen_height * color_channels * number_of_frames_of_state = 201600 being mapped to three actions, a much higher level of complexity.

The first thing we can do to make life easier for ourselves is to resize the screen to a smaller size. From experimentation, we find that you can still play Breakout with a much smaller screen. Scaling down by a factor of 2 still allows you to see the ball, the paddles, and all the blocks. Also, a lot of the image space is not useful information for the agent. The score at the top, the gray patches along the sides and top, and the black space at the bottom can all be cropped from the image. This allows us to reduce the 210 * 160 screen into a more manageable 72 by 84, reducing the number of parameters by more than three quarters.

Also in Breakout, the color of the pixels doesn't contain any useful information, so we can replace the three colors with just a single color, which is only ever black or white, reducing the number of inputs again to just a third. We are now down to 72 by 84 = 6048 bits, and we need two frames of the game to be able to learn from. Let's write a method that does this processing of the Breakout screen:

def pre_process(screen_image):

The screen_image argument will be the Numpy array that we get from the env.reset or env.next_step operations on OpenAIGym. It will have shape 210 by 160 by 3, with each item being an int between 0 and 255 representing the value for that color:

    screen_image = screen_image[32:-10, 8:-8]

This operation on the Numpy array crops the image, so we remove the scores at the top, black space at the bottom, and gray areas on either side:

    screen_image = screen_image[::2, ::2, 0]

The ::2 argument to a Python array means we take every second item, which conveniently Numpy also supports. The 0 at the end means we just take the red color channel, which is fine because we are about to turn it into just black and white anyway. screen_image will now be of size 72 by 84 by 1:

    screen_image[screen_image != 0] = 1

This sets everything that isn't completely black in the image to 1. This may not work for some games where you need precise contrast, but it works fine for Breakout:

    return screen_image.astype(np.float)

Finally, this returns the screen_image from the method making sure that the type is converted to float. This will save time later when we want to put our values into TensorFlow. Figure 10 shows how the screen looks before and after preprocessing. After processing, though it is a lot less pretty, the image still contains all the elements you would need to play the game:

Preprocessing the screen

Figure 10: Breakout before and after processing

This leaves our state as 72*84*2 = 12800 bits, meaning we have 212800 possible states that we need to map our three actions to. This sounds like a lot, but the problem is made simpler by the fact that though that is the full range of states possible in Breakout, only a quite small and predictable set of the states will occur. The paddles move horizontally across a fixed area; a single pixel will be active for the ball, and some number of blocks will exist across the central area. One could easily imagine that there are a small set of features that could be extracted from the image that might better relate to the actions we want to take—features such as the relative position of our paddle from the ball, the velocity of the ball, and so on—the kind of features deep neural networks can pick up.

Creating a deep convolutional network

Let's next replace the single hidden layer network from the cart pole example with a deep convolutional network. Convolutional networks were first introduced in Chapter 4, Unsupervised Feature Learning. A convolutional network makes sense because we are dealing with image data. The network we create will have three convolutional layers leading to a single flat layer, leading to our output. Having four hidden layers makes some intuitive sense because we know we are going to need to detect very abstract invariant representations from the pixels, but it has also been shown to work successfully for a range of architectures. Because this is a deep network, relu activation functions make sense. Figure 11 shows what the network will look like:

Creating a deep convolutional network

Figure 11: Architecture for our network that will learn to play breakout.

Here is the code for creating our deep convolutional network:

SCREEN_HEIGHT = 84
SCREEN_WIDTH = 74
STATE_FRAMES = 2

CONVOLUTIONS_LAYER_1 = 32
CONVOLUTIONS_LAYER_2 = 64
CONVOLUTIONS_LAYER_3 = 64
FLAT_HIDDEN_NODES = 512

These constants will be used throughout our create_network method:

def create_network():
    input_layer = tf.placeholder("float", [None, SCREEN_HEIGHT, SCREEN_WIDTH, STATE_FRAMES])

We define our input to be a product of the height, the width, and the state frames; the none dimension will be for batches of states:

convolution_weights_1 = tf.Variable(tf.truncated_normal([8, 8, STATE_FRAMES, CONVOLUTIONS_LAYER_1], stddev=0.01))
    convolution_bias_1 = tf.Variable(tf.constant(0.01, shape=[CONVOLUTIONS_LAYER_1]))

The first convolutional layer will be an 8 by 8 widow across the width and height, taking in both the state frames. So it will get data on both the current 8 by 8 section of what the image looks like and what that 8 by 8 patch looked like in the previous frame. Each patch will map to 32 convolutions that will be the input to the next layer. We give the bias a very slight positive value; this can be good for layers with relu activations to reduce the number of dead neurons caused by the relu function:

    hidden_convolutional_layer_1 = tf.nn.relu(
        tf.nn.conv2d(input_layer, convolution_weights_1, 
        strides=[1, 4, 4, 1], padding="SAME") + convolution_bias_1)

We put the weight and bias variables into the convolutional layer. This is created by the tf.nn.conv2d method. Setting strides=[1, 4, 4, 1] means the 8 by 8 convolutional window will be applied every four pixels across the width and height of the image. All the convolutional layers will go through the relu activation function:

    convolution_weights_2 = tf.Variable(tf.truncated_normal([4, 4, CONVOLUTIONS_LAYER_1, CONVOLUTIONS_LAYER_2], stddev=0.01))
    convolution_bias_2 = tf.Variable(tf.constant(0.01, shape=[CONVOLUTIONS_LAYER_2]))

    hidden_convolutional_layer_2 = tf.nn.relu(
        tf.nn.conv2d(hidden_convolutional_layer_1, convolution_weights_2, strides=[1, 2, 2, 1], padding="SAME") + convolution_bias_2)
    convolution_weights_3 = tf.Variable(tf.truncated_normal([3, 3, CONVOLUTIONS_LAYER_2, CONVOLUTIONS_LAYER_3], stddev=0.01))
    convolution_bias_3 = tf.Variable(tf.constant(0.01, shape=[CONVOLUTIONS_LAYER_2]))
    hidden_convolutional_layer_3 = tf.nn.relu(
        tf.nn.conv2d(hidden_convolutional_layer_2, convolution_weights_3, strides=[1, 1, 1, 1], padding="SAME") + convolution_bias_3)

Creating the next two convolutional layers proceeds in the same way. Our final convolutional layer, hidden_convolutional_layer_3, must now be connected to a flat layer:

    hidden_convolutional_layer_3_flat = tf.reshape(hidden_convolutional_layer_3, [-1, 9*11*CONVOLUTIONAL_LAYER_3])

This reshapes our convolutional layer, which is of dimensions none, 9, 11, 64 into a single flat layer:

    feed_forward_weights_1 = tf.Variable(tf.truncated_normal([FLAT_SIZE, FLAT_HIDDEN_NODES], stddev=0.01))
    feed_forward_bias_1 = tf.Variable(tf.constant(0.01, shape=[FLAT_HIDDEN_NODES]))

    final_hidden_activations = tf.nn.relu(
        tf.matmul(hidden_convolutional_layer_3_flat, feed_forward_weights_1) + feed_forward_bias_1)

    feed_forward_weights_2 = tf.Variable(tf.truncated_normal([FLAT_HIDDEN_NODES, ACTIONS_COUNT], stddev=0.01))
    feed_forward_bias_2 = tf.Variable(tf.constant(0.01, shape=[ACTIONS_COUNT]))

    output_layer = tf.matmul(final_hidden_activations, feed_forward_weights_2) + feed_forward_bias_2

    return input_layer, output_layer

We then create the last two flat layers in the standard way. Note that there is no activation function on the final layer because we are learning the value of an action in a given state here, and that has an unbounded range.

Our main loop will now need the following code added so that the current state is the combination of multiple frames, for breakout STATE_FRAMES is set to 2, but higher numbers will also work:

screen_binary = preprocess(observation)

if last_state is None:
last_state = np.stack(tuple(screen_binary for _ in range(STATE_FRAMES)), axis=2)

If we have no last_state, then we construct a new Numpy array that is just the current screen_binary stacked as many times as we want STATE_FRAMES:

else:
    screen_binary = np.reshape(screen_binary, (SCREEN_HEIGHT, SCREEN_WIDTH, 1))
    current_state = np.append(last_state[:, :, 1:], screen_binary, axis=2)

Otherwise, we append our new screen_binary into the first position in our last_state to create the new current_state. Then we just need to remember to re-assign our last_state to equal our current state at the end of the main loop:

last_state = current_state

One issue we may now run into is that our state space is now a 84*74*2 array, and we want to be storing in the order of 1,000,000 of these as our list of past observations from which to train. Unless your computer is quite a beast, you may start to run into memory issues. Fortunately, a lot of these arrays will be very sparse and only contain two states, so one simple solution to this is to use in memory compression. This will sacrifice a bit of CPU time to save on memory; so before using this, consider which one is more important to you. Implementing it in Python is only a few lines:

import zlib
import pickle

observations.append(zlib.compress(
pickle.dumps((last_state, last_action, reward, current_state, terminal), 2), 2))

Here, we compress the data before adding it to our list of observations:

mini_batch_compressed = random.sample(_observations, MINI_BATCH_SIZE)
mini_batch = [pickle.loads(zlib.decompress(comp_item)) for comp_item in mini_batch_compressed]

Then when sampling from the list, we decompress only our mini batch sample as we use it.

Another issue we may run into is that while the cart pole is trained in as little as a couple of minutes, for Breakout, training will be measured in days. To guard against something going wrong, such as a power failure shutting off the computer, we will want to start saving our network weights as we go. In Tensorflow, this are only a couple of lines:

CHECKPOINT_PATH = "breakout"

saver = tf.train.Saver()

if not os.path.exists(CHECKPOINT_PATH):
    os.mkdir(CHECKPOINT_PATH)

checkpoint = tf.train.get_checkpoint_state(CHECKPOINT_PATH)
if checkpoint:
    saver.restore(session, checkpoint.model_checkpoint_path)

This can be put at the beginning of the file, just the session.run( tf.initialize_all_variables()) line. Then we just need to run the following command:

saver.save(_session, CHECKPOINT_PATH + '/network')

This means every couple of thousand training iterations are to have regular backups of our network created. Now let's see what training looks like:

Creating a deep convolutional network

As we can see after 1.7 million iterations, we are playing at a level well above random. This same Q-learning algorithm has been tried on a wide range of Atari games, and with good hyper parameter, tuning was able to achieve human level or higher performance in, among others, Pong, Space Invaders, and Q*bert.

Convergence issues in Q-learning

But it's not all plane sailing. Let's see how agent training is continued after the end of the preceding sequence:

Convergence issues in Q-learning

As you can see, at a certain point the agent's ability took a massive and prolonged drop off before returning to a similar level. The likely reason for this (as much as we can ever know the exact reasons) is one of the problems with Q-learning.

Q-learning is training against its own expectation of how good it thinks a state action pair will be. This is a moving target because every time you run a training step, the targets change. We hope that they are moving towards a more accurate estimation of the reward. But as they head there, small changes in parameters can result in quite extreme oscillations.

Once we end up in a space where we are doing worse than our previous estimations of ability, every single state action evaluation must adjust to this new reality. If we're getting an average of 30 points a game and with our new policy, we are down to just 20, the whole network must adjust to this.

Target network freezing (Minh et al 2015 Human-level control through deep reinforcement learning—Nature) can help reduce this. A second neural network, referred to as the target network, is created as a copy of the main training network. During training, we use the target network to generate the target values used to train the main neural network. In this way, the main network is learning against a more fixed point. The target networks weight frozen, but once a set number of iterations have passed, or a convergence criterion is reached, the target network is updated with the values from the main network. This process has been shown to significantly speed up training.

Another issue that a lot of reinforcement learning can struggle with pertains to games that have quite extreme rewards. Pac-Man, for example, has a very high reward for taking the power pill and then eating the ghosts. These extreme rewards received can cause problems with the gradients and lead to sub-optimal learning. The very easy but unsatisfactory way to fix this is called reward clipping, which just involves clipping the rewards received from the environment in some range (-1 and +1 are commonly used). For very little effort, this works, but it has the problem that the agent has lost the information about these larger rewards.

Another approach is what's called a normalized deep q network (Hasselt et al—learning values across many orders of magnitude, 2016). This involves setting up the neural network to output the expected reward of the state and action in the -1 to 1 range. Put the output into this range; it is put through the following equation:

Convergence issues in Q-learning

Here, U(s, a) is the output of the neural network. The parameters σ and µ can be worked out by making sure that the scaled output is constant between the target and main network, as described in target network freezing:

Convergence issues in Q-learning

Using this approach, the neural network gradients will be directed more towards learning the relative values of states and actions as opposed to expending energy simply learning the scale of the Q-value.

Policy gradients versus Q-learning

Though we gave an example using policy gradients for learning a board game and Q-learning for a computer game, neither technique is limited to that type. Originally, Q-learning was considered the better technique overall, but over time and with better hyperparameters, tuning policy gradients have often been shown to perform better. The world's best performance in backgammon was achieved in 1991 using a neural network and Q-learning, and latest research suggests that policy gradients are best for most Atari games. So when should you use policy gradients versus Q-learning?

One constraint is that Q-learning can only work for discrete action tasks, whereas policy gradients can learn continuous action tasks. Also Q-learning is a deterministic algorithm, and for some tasks, the optimum behavior involves having some degree of randomness. For example, rock, paper, scissors, where any behavior that deviates from purely random can be exploited by an opponent.

There is also the online versus offline aspect. For many tasks, especially robot-control tasks, online learning may be very expensive. The ability to learn from memory is needed, so Q-learning is the best option. Unfortunately, the success of both Q-learning and policy gradients can vary a lot depending on the task and choice of hyperparameters; so when determining the best for a new task, experimentation appears to be the best approach.

Policy gradients also have a greater tendency to get stuck in local minima. Q-learning has a better chance of finding the global optima, but the cost of this is that it is not proven to converge, and performance may oscillate wildly or fail completely on its way there.

But there is also another approach that takes some of the best aspects of both. These are known as actor-critic methods.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset