Q-learning

With the introduction of quality iteration methods, the derivation of a finite state method called Q-learning or quality learning was derived. Q uses the technique of quality iteration for a given finite state problem to determine the best course of action for an agent. The equation we saw in the previous section can now be represented as the following:

Consider the following equation:

  •   current state
  •    current action
  •  next action
  •  current reward
  •  learning rate (alpha)
  •  reward discount factor (gamma)

The Q value is now updated alliteratively, as the agent roams through its environment. Nothing demonstrates these concepts better than an example. Open up Chapter_5_3.py and follow these steps:

  1. We start with the various imports and set them up as shown in the following code:
from collections import deque
import numpy as np
import os
clear = lambda: os.system('cls') #linux/mac use 'clear'
import time
import gym
from gym import wrappers, logger
  1. These imports just load the basic libraries we need for this example. Remember, you will need to install Gym to run this sample.
  2. Next, we set up a new environment; in this example, we use the basic FrozenLake-v0 sample, a perfect example to test on Q-learning:
environment = 'FrozenLake-v0'
env = gym.make(environment)
  1. Then we set up the AI environment (env) and a number of other parameters:
outdir = os.path.join('monitor','q-learning-{0}'.format(environment))
env = wrappers.Monitor(env, directory=outdir, force=True)
env.seed(0)
env.is_slippery = False
q_table = np.zeros([env.observation_space.n, env.action_space.n])

#parameters
wins = 0
episodes = 40000
delay = 1

epsilon = .8
epsilon_min = .1
epsilon_decay = .001
gamma = .9
learning_rate = .1
  1. In this section of the code, we set up a number of variables that we will get to shortly. For this sample, we are using a wrapper tool to monitor the environment, and this is useful for determining any potential training issues. The other thing to note is the setup of the q_table array, defined by the environment observation_space (state) and action_space (action); spaces define arrays and not just vectors. In this particular example, the action_space is a vector, but it could be a multi-dimensional array or tensor.
  1. Pass over the next section of functions and skip to the end, where the training iteration occurs and is shown in the following code:
for episode in range(episodes): 
state = env.reset()
done = False
while not done:
action = act(env.action_space,state)
next_state, reward, done, _ = env.step(action)
clear()
env.render()
learn(state, action, reward, next_state)
if done:
if reward > 0:
wins += 1
time.sleep(3*delay)
else:
time.sleep(delay)

print("Goals/Holes: %d/%d" % (wins, episodes - wins))
env.close()
  1. Most of the preceding code is relatively straightforward and should be easy to follow. Look at how the env (environment) is using the action generated from the act function; this is used to step or conduct an action on the agent. The output of the step function is next_state, reward, and done, which we use to determine the optimum Q policy by using the learn function.
  2. Before we get into the action and learning functions, run the sample and watch how the agent trains. It may take a while to train, so feel free to return to the book.

The following is an example of the OpenAI Gym FrozenLake environment running our Q-learning model:



FrozenLake Gym environment

As the sample runs, you will see a simple text output showing the environment. S represents the start, G the goal, F a frozen section, and H a hole. The goal for the agent is to find its way through the environment, without falling in a hole, and reach the goal. Pay special attention to how the agent moves and finds it way around the environment. In the next section, we unravel the learn and act functions and understand the importance of exploration.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset