Q-learning in action

A game may have in the region of 16-60 frames per second, and often rewards will be received based on actions taken many seconds ago. Also, the state space is vast. In computer games, the state contains all the pixels on the screen used as input to the game. If we imagine a screen downsampled to say 80 x 80 pixels, all of which are single color and binary, black or white, that is still a 2^6400 state. This makes a direct map from state to reward impractical.

What we will need to do is learn an approximation of the Q-function. This is where neural networks can be used for their universal function approximation ability. To train our Q-function approximation, we will store all the game states, rewards, and actions our agent took as it plays through the game. The loss function for our network will be the square of the difference between its approximation of the reward in the previous state and the actual reward it got in the current state, plus its approximation of the reward for the current state it reached in the game, times the discount factor:

Q-learning in action

s is the previous state, a is the action that was taken in that state, and 0 < g < 1 is the discount factor. rewards is a function that returns the reward for taking an action in a state. actions is a function that returns the s' state and that you transition into after taking actions a in state s and all the actions a' available in that state. Q is the Q-function presented earlier.

By training successive iterations in this manner, our Q-function approximator will slowly converge towards the true Q-function.

Let's start by training the Q-function for the worlds simplest game. The environment is a one-dimensional map of states. A hypothetical agent must navigate the maze by moving either left or right to maximize its reward. We will set up the rewards for each state as follows:

rewards = [0, 0, 0, 0, 1, 0, 0, 0, 0]

If we were to visualize it, it might look something like this:

Q-learning in action

Figure 4: Simple maze game, agent can move between connected nodes and can get a reward of 1 in the top node.

If we were to put our agent into a space in this "maze" in position 1, he would have the option of moving to either position 0 or 2. We want to build a network that learns the value of each state and, by extension, the value of taking an action that moves to that state. The first pass of training our network will learn just the innate rewards of each state. But on the second pass, it will use the information gained from the first pass to improve its estimation of the rewards. What we expect to see at the end of training is a pyramid shape, with the most value in the 1 reward space and then decreasing value on either side as we move away from the center to spaces where you would have to travel further, and thus apply more future discounting to get the reward. Here is how this looks in code (the full sample is in q_learning_1d.py in the Git repo):

import tensorflow as tf
import numpy as np

states = [0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0]
NUM_STATES = len(states)

We create a list of states; the value of each item in the list is the reward the agent will get for moving to that position. In this example, it gets a reward for getting to the 5th position:

NUM_ACTIONS = 2
DISCOUNT_FACTOR = 0.5

def one_hot_state(index):
    array = np.zeros(NUM_STATES)
    array[index] = 1.
    return array

This method will take a number and change it into a one-hot encoding for the space of our states, for example, 3 becomes [0, 0, 0, 1, 0, 0, 0, 0, 0, 0]:

session = tf.Session()
state = tf.placeholder("float", [None, NUM_STATES])
targets = tf.placeholder("float", [None, NUM_ACTIONS])

We create a TensorFlow session and placeholders for our input and targets; the None in the arrays is for the mini-batch dimension:

weights = tf.Variable(tf.constant(0., shape=[NUM_STATES, NUM_ACTIONS]))
output = tf.matmul(state, weights)

For this simple example, we can accurately value everything just using a linear relationship between the state and the action reward, so we will only create an output layer that is a matrix multiplication of the weights. There's no need for a hidden layer or any kind of non-linearity function:

loss = tf.reduce_mean(tf.square(output - targets))
train_operation = tf.train.GradientDescentOptimizer(0.05).minimize(loss)
session.run(tf.initialize_all_variables())

We use the MSE for the loss and standard gradient descent training. What makes this Q-learning is what we will eventually use as the value for the targets:

for _ in range(1000):
    state_batch = []
    rewards_batch = []

    for state_index in range(NUM_STATES):
        state_batch.append(one_hot_state(state_index))

We create a state_batch, each item of which is each of the states in the game, encoded in one hot form. For example, [1, 0, 0, 0, 0, 0, 0, 0, 0], [0, 1, 0, 0, 0, 0, 0, 0, 0], and so on. We will then train the network to approximate the value for each state:

         minus_action_index = (state_index - 1) % NUM_STATES
         plus_action_index = (state_index + 1) % NUM_STATES

For each state, we now get the position we would be in if we took each possible action from that state. Note for the example that the states wrap around, so moving -1 from position 0 puts you in position 8:


        minus_action_state_reward = session.run(output, feed_dict={state: [one_hot_state(minus_action_index)]})
        plus_action_state_reward = session.run(output, feed_dict={state: [one_hot_state(plus_action_index)]})

We use our network, which is our q-function approximator to get the reward it thinks we will get if we were to take each of the actions, minus_action_index and plus_action_index, which is the reward the network thinks we would get in the states it puts us into:

        minus_action_q_value = states[minus_action_index] + DISCOUNT_FACTOR * np.max(minus_action_state_reward)

        plus_action_q_value = states[plus_action_index] + DISCOUNT_FACTOR * np.max(plus_action_state_reward)]

Here, we have the Python version of the now familiar Q-function equation. We take the initial reward for moving into a state and add to it the DISCOUNT_FACTOR times the max reward we could receive for our actions taken in that state:

action_rewards = [minus_action_q_value, plus_action_q_value]
rewards_batch.append(action_rewards)

We add these to the rewards_batch, which will be used as targets for the training operation:

session.run(train_operation, feed_dict={
        state: state_batch,
        targets: rewards_batch})

print([states[x] + np.max(session.run(output, feed_dict={state: [one_hot_state(x)]}))
    for x in range(NUM_STATES)])

We then run the actual train step once we have the full set of rewards for each state. If we run this script and look at the output, we can get a sense of how the algorithm iteratively updates. After the first training run, we see this:

[0.0, 0.0, 0.0, 0.05, 1.0, 0.05, 0.0, 0.0, 0.0, 0.0]

Everything is 0, except the items on either side of the rewarding state. These two states now get a reward on the basis that you could move from them to the reward square. Go forward a few more steps and you see that the reward has started to spread out across the states:

[0.0, 0.0, 0.013, 0.172, 1.013, 0.172, 0.013, 0.0, 0.0, 0.0]

The eventual output for this program will look something like this:

[0.053, 0.131, 0.295, 0.628, 1.295, 0.628, 0.295, 0.131, 0.053, 0.02]

As you can see, the highest reward is in the fifth spot in the array, the position we originally set up to have the reward. But the reward we gave was only 1; so why is the reward here higher than that? This is because 1.295 is the sum of the reward gained for being in the current space plus the reward we can get in the future for moving away from this space and coming back again repeatedly, with these future rewards reduced by our discount factor, 0.5.

Learning this kind of future reward to infinity is good, but rewards are often learned in the process of doing a task that has a fixed end. For example, the task might be stacking objects on a shelf that ends when either the stack collapses or all objects are stacked. To add this concept into our simple 1-D game, we need to add in terminal states. These will be states where, once reached, the task ends; so in contrast to every other state, when evaluating the Q-function for it, we would not train by adding a future reward. To make this change, first we need an array to define which states are terminal:

terminal = [False, False, False, False, True, False, False, False, False, False]

This will be set in the fifth state, the one we get the reward from to be terminal. Then all we need is to modify our training code to take into account this terminal state:

        if terminal[minus_action_index]:
            minus_action_q_value = DISCOUNT_FACTOR * states[minus_action_index]
        else:
            minus_action_state_reward = session.run(output, feed_dict={state: [one_hot_state(minus_action_index)]})
            minus_action_q_value = DISCOUNT_FACTOR *(states[minus_action_index] + np.max(minus_action_state_reward))

        if terminal[plus_action_index]:
            plus_action_q_value = DISCOUNT_FACTOR * states[plus_action_index]
        else:
            plus_action_state_reward = session.run(output,
            feed_dict={state: [one_hot_state(plus_action_index)]})
            plus_action_q_value = DISCOUNT_FACTOR * (states[plus_action_index] + np.max(plus_action_state_reward))

If we run the code again now, the output will settle to this:

[0.049, 0.111, 0.242, 0.497, 1.0, 0.497, 0.242, 0.111, 0.0469, 0.018]
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset