The Q-Learning algorithm

Solving an RL problem requires an estimate, during the learning process, of an evaluation function. This function must be able to assess, through the sum of the rewards, the success of a policy.

The basic idea of Q-Learning is that the algorithm learns the optimal evaluation function for the entire space of states and actions (S × A). This so-called Q-function provides a match in the form Q: S × A -> R, where R is the expected value of the future rewards of an action The Q-Learning algorithm executed in the state, The Q-Learning algorithm. Once the agent has learned the optimal function, Q, it will be able to recognize what action will lead to the highest future reward in a certain state.

One of the most commonly used examples of implementing the Q-Learning algorithm involves the use of a table. Each cell of the table is a value Q(s; a)= R and it is initialized to 0. The action The Q-Learning algorithm, performed by the agent, is chosen using a policy which is epsilon-greedy with respect to Q.

The basic idea of the Q-Learning algorithm is the training rule, which updates a table element Q (s; a).

The algorithm follows these basic steps:

  1. Initialize Q (s; a) arbitrarily.
  2. Repeat the following (for each episode):
    1. Initialize s.
    2. Repeat (for each step of episode):
    3. Choose an action The Q-Learning algorithm from The Q-Learning algorithm using policy derived from Q.
    4. Take an action a, observe r, s':
      The Q-Learning algorithm

      s' : s <- s'

    5. Continue until s is terminal.

We have depicted the algorithm in the following diagram:

The Q-Learning algorithm

Figure 2: Q-Learning algorithm

Let's summarize the parameters used in the Q-value update process:

  • The Q-Learning algorithm is the learning rate, which is set between 0 and 1. Setting it to 0 means that the Q-values are never updated, and hence nothing is learned. Setting a high value such as 0.9 means that learning can occur quickly.
  • The Q-Learning algorithm is the discount factor, which is set between 0 and 1. This models the fact that future rewards are worth less than immediate rewards. Mathematically, the discount factor needs to be set less than 1 for the algorithm to converge.
  • max Q(s'; a) is the maximum reward that is attainable in the state following the current one, that is, the reward for taking the optimal action thereafter.

The FrozenLake environment

The agent controls the movement of a character in a 4×4 grid world. Some tiles of the grid are walkable, and others lead to the agent falling into the water. Additionally, the movement direction of the agent is uncertain and only partially depends on the chosen direction. The agent is rewarded for finding a walkable path to a goal tile:

The FrozenLake environment

Figure 3: A representation of the Frozen-Lake v0 grid word

The surface shown above is described using a grid such as the following:

SFFF   (S: starting point, safe)
FHFH   (F: frozensurface, safe)
FFFH   (H: hole, fall to yourdoom)
HFFG   (G: goal, where the frisbee islocated)

The episode ends when we reach the goal or fall in a hole. We receive a reward of 1 if we reach the goal, and 0 otherwise.

Q-Learning for the FrozenLake problem

Neural networks are exceptionally strong at coming up with good features for highly structured data.

To resolve the FrozenLake problem, we'll build a one-layer network that takes the state encoded in a [1× 16] vector and learns the best move (action), mapping the possible actions in a vector of length four.

The following implementation is based in TensorFlow:

First, we need to import all the libraries:

import gym
import numpy as np
import random
import tensorflow as tf
import matplotlib.pyplot as plt

Then we load and set the environment to test:

env = gym.make('FrozenLake-v0')

The input network is a state, encoded in a tensor of shape [1,16]. For this reason, we define the inputs1 placeholder:

inputs1 = tf.placeholder(shape=[1,16],dtype=tf.float32)

The network weights are initially chosen randomly by the tf.random_uniform function:

W = tf.Variable(tf.random_uniform([16,4],0,0.01))

The network output is given by the product of the inputs1 placeholder and the weights:

Qout = tf.matmul(inputs1,W)

The argmax evaluated on Qout will give the predicted value:

predict = tf.argmax(Qout,1)

The best move (nextQ) is encoded in a tensor of shape [1,4]:

nextQ = tf.placeholder(shape=[1,4],dtype=tf.float32)

Next, we define a loss function to implement the backpropagation procedure.

The loss function is The FrozenLake environment, where the difference between the current predicted Q-values and the target value is computed and the gradients are passed through the network:

loss = tf.reduce_sum(tf.square(nextQ - Qout))

The optimizing function is the well-known GradientDescentOptimizer:

trainer = tf.train.GradientDescentOptimizer(learning_rate=0.1)
updateModel = trainer.minimize(loss)

Reset and initialize the computational graph:

tf.reset_default_graph()
init = tf.global_variables_initializer()

Then we set the parameter for the Q-Learning training procedure:

y = .99
e = 0.1
num_episodes = 6000

jList = []
rList = []

We define the session, sess, in which the network will have to learn the best possible sequence of moves:

with tf.Session() as sess:
    sess.run(init)
    for i in range(num_episodes):
        s = env.reset()
        rAll = 0
        d = False
        j = 0

        while j < 99:
            j+=1

The input state is used here to feed the network:

            a,allQ = sess.run([predict,Qout],
                              feed_dict=
                              {inputs1:np.identity(16)[s:s+1]})

A random state is chosen from the output tensor a:

            if np.random.rand(1) < e:
                a[0] = env.action_space.sample()

Evaluate the a[0] action using the env.step() function, obtaining the reward, r, and the state, s1:

                     s1,r,d,_ = env.step(a[0])

This new state, s1, is used to update the Q-tensor:

            Q1 = sess.run(Qout,feed_dict=
                          {inputs1:np.identity(16)[s1:s1+1]})
            maxQ1 = np.max(Q1)
            targetQ = allQ
            targetQ[0,a[0]] = r + y*maxQ1

Of course, the weights must be updated for the backpropagation procedure:

           _,W1 = sess.run([updateModel,W],
                             feed_dict=
                           {inputs1:np.identity(16)[s:s+1],nextQ:targetQ})

rAll here defines the total reward that will be gained during the session. Let's recall that the goal of an RL agent is to maximize the total reward that it receives in the long run:

rAll += r

Update the state of the environment for the next step:

          s = s1
           if d == True:
                e = 1./((i/50) + 10)
                break
   jList.append(j)
   rList.append(rAll)

When the computation ends, the percent of successful episodes will be displayed:

print ("Percent of successfulepisodes: " +
str(sum(rList)/num_episodes) + "%")

If we run the model, we should get a result like this, which can be improved by tuning the network parameters:

>>>[2017-01-15 16:56:01,048] Making new env: FrozenLake-v0
Percentage of successful episodes: 0.558%
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset