Q-learning with TensorFlow

In the previous example, we saw how it is relatively simple, using a 16x4 grid, to update the Q-table at each step of the learning process. It is easy to imagine that the use of this table can serve for simple problems, but in real-world problems, we need a more sophisticated mechanism to update the system state. This is the point where deep learning steps in. Neural networks are exceptionally good at coming up with good features for highly structured data.

In this final section, we'll look at how to manage a Q-function with a neural network, which takes the state and action as input, and outputs the corresponding Q-value.

To do that, we'll build a one layer network that takes the state, encoded in a [1x16] vector, which learns the best move (action), mapping the possible actions in a vector of length four.

A recent application of deep Q-networks has been successful at playing some Atari 2600 games at expert human levels. Preliminary results were presented in 2014, with a paper published in February 2015, in Nature.

In the following, we describe our TensorFlow-based implementation of a Q-learning neural network for the FrozenLake-v0 problem.

Import all the libraries with the help of the following code:

import gym 
import numpy as np 
import random 
import tensorflow as tf 
import matplotlib.pyplot as plt

To install matplotlib, you should execute the following commands on terminal:

$ apt-cache search python3-matplotlib

If you find it like its available then you can install it from:

$ sudo apt-get install python3-matplotlib

Load and set the environment to test:

env = gym.make('FrozenLake-v0')

The input network is a state, encoded in a tensor of shape [1,16]. For this reason, we define the inputs1 placeholder:

inputs1 = tf.placeholder(shape=[1,16],dtype=tf.float32)

The network weights are initially chosen randomly by the tf.random_uniform function:

W = tf.Variable(tf.random_uniform([16,4],0,0.01))

The network output is given by the product of the inputs1 placeholder and the weights:

Qout = tf.matmul(inputs1,W)

The argmax evaluated on Qout will give the predicted value:

predict = tf.argmax(Qout,1)

The best move (Qtarget) is encoded in a [1,4] tensor shape:

Qtarget = tf.placeholder(shape=[1,4],dtype=tf.float32)

Next, we must define a loss function to optimize for the backpropagation procedure. The loss function is as follows:

Where the difference between the current predicted Q-values and the target value is computed, and the gradients are passed through the network:

loss = tf.reduce_sum(tf.square(Qtarget- Qout))

The optimizing function, is the well-known GradientDescentOptimizer:

trainer = tf.train.GradientDescentOptimizer(learning_rate=0.1) 
updateModel = trainer.minimize(loss)

Reset and initialize the computational graph:

tf.reset_default_graph() 
init = tf.global_variables_initializer()

Following this, we set the parameter for the Q-learning training procedure:

gamma = .99 
e = 0.1 
num_episodes = 6000 

jList = [] 
rList = []

We carry out the running session, in which the network will have to learn the best possible sequence of moves:

with tf.Session() as sess: 
    sess.run(init) 
    for i in range(num_episodes): 
        s = env.reset() 
        rAll = 0 
        d = False 
        j = 0 

        while j < 99: 
            j+=1

The input state is used here to feed the network:

            a,allQ = sess.run([predict,Qout], 
                              feed_dict= 
                              {inputs1:np.identity(16)[s:s+1]})

A random state is chosen from the output tensor, a:

            if np.random.rand(1) < e: 
                a[0] = env.action_space.sample()

Evaluate the action, a[0], using the function env.step(), obtaining the reward, r, and the state, s1:

                     s1,r,d,_ = env.step(a[0])

This new state s1 is used to update the Q-tensor:

            Q1 = sess.run(Qout,feed_dict= 
                          {inputs1:np.identity(16)[s1:s1+1]}) 
            maxQ1 = np.max(Q1) 
            targetQ = allQ 
            targetQ[0,a[0]] = r + y*maxQ1

Of course, the weights must be updated for the backpropagation procedure:

           _,W1 = sess.run([updateModel,W], 
                             feed_dict= 
                           {inputs1:np.identity(16)[s:s+1],nextQ:targetQ})

The rAll parameter, here, defines the total reward that will be incremented during the session. Let's recall that the goal of a Reinforcement Learning agent will be to maximize the total reward that it receives in the long run:

           rAll += r

Update the state of the environment for the next step:

          s = s1 
           if d == True: 
                e = 1./((i/50) + 10) 
                break 
   jList.append(j) 
   rList.append(rAll)

When the computation ends, the percent of successful episodes will be displayed:

print "Percent of succesfulepisodes: " + 
str(sum(rList)/num_episodes) + "%"

Running the model, you should have a result like the following, which can be improved by tuning the network parameters:

>>>
[2017-03-23 12:36:19,986] Making new env: FrozenLake-v0
Percent of successful episodes: 0.558%
>>>

Table of Contents for Q-learning with TensorFlow

Create new playlist

Sign In

Sign Up

Table of Contents for
Q-learning with TensorFlow