First DRL with Deep Q-learning

Now that we understand the reinforcement learning process in detail, we can look to adapt our Q-learning model to work with deep learning. This, as you could likely guess, is the culmination of our efforts and where the true power of RL shines. As we learned through earlier chapters, deep learning is essentially a complex system of equations that can map inputs through a non-linear function to generate a trained output.

A neural network is just another, simpler method of solving a non-linear equation. We will look at how to use DNN to solve other equations later, but for now we will focus on using it to solve the Q-learning equation we saw in the previous section.

We will use the CartPole training environment from the OpenAI Gym toolkit. This environment is pretty much the standard used to learn Deep Q-learning (DQN).

Open up Chapter_5_4.py and follow the next steps to see how we convert our solver to use deep learning:

  1. As usual, we look at the imports and some initial starting parameters, as follows:
import random
import gym
import numpy as np
from collections import deque
from keras.models import Sequential
from keras.layers import Dense
from keras.optimizers import Adam

EPISODES = 1000
  1. Next, we are going to create a class this time to contain the functionality of the DQN agent. The __init__ function is as follows:
class DQNAgent:
def __init__(self, state_size, action_size):
self.state_size = state_size
self.action_size = action_size
self.memory = deque(maxlen=2000)
self.gamma = 0.95 # discount rate
self.epsilon = 1.0 # exploration rate
self.epsilon_min = 0.01
self.epsilon_decay = 0.995
self.learning_rate = 0.001
self.model = self._build_model()
  1. Most of the parameters have already been covered, but note a new one called memory, which is a deque collection that holds that last 2,000 steps. This allows us to batch train our neural network in a sort of replay mode.
  1. Next, we look at how the neural network model is built with the _build_model function, as follows:
def _build_model(self):
# Neural Net for Deep-Q learning Model
model = Sequential()
model.add(Dense(24, input_dim=self.state_size, activation='relu'))
model.add(Dense(24, activation='relu'))
model.add(Dense(self.action_size, activation='linear'))
model.compile(loss='mse',
optimizer=Adam(lr=self.learning_rate))
return model
  1. This builds a fairly simple model, compared to others we have already seen, with three dense layers outputting a value for each action. The input into this network is the state.
  2. Jump down to the bottom of the file and look at the training iteration loop, shown as follows:
if __name__ == "__main__":
env = gym.make('CartPole-v1')
state_size = env.observation_space.shape[0]
action_size = env.action_space.n
agent = DQNAgent(state_size, action_size)
# agent.load("./save/cartpole-dqn.h5")
done = False
batch_size = 32

for e in range(EPISODES):
state = env.reset()
state = np.reshape(state, [1, state_size])
for time in range(500):
# env.render()
action = agent.act(state)
env.render()
next_state, reward, done, _ = env.step(action)
reward = reward if not done else -10
next_state = np.reshape(next_state, [1, state_size])
agent.remember(state, action, reward, next_state, done)
state = next_state
if done:
print("episode: {}/{}, score: {}, e: {:.2}"
.format(e, EPISODES, time, agent.epsilon))
break
if len(agent.memory) > batch_size:
agent.replay(batch_size)

  1. In this sample, our training takes place in a real-time render loop. The important sections of the code are highlighted, showing the reshaping of the state and calling the agent.remember function. The agent.replay function at the end is where the network trains. The remember function is as follows:

def remember(self, state, action, reward, next_state, done):
self.memory.append((state, action, reward, next_state, done))
  1. This function just stores the state, action, reward, next_state,  and done parameters for the replay training. Scroll down more to the replay function, as follows:
def replay(self, batch_size):
minibatch = random.sample(self.memory, batch_size)
for state, action, reward, next_state, done in minibatch:
target = reward
if not done:
target = (reward+self.gamma*
np.amax(self.model.predict(next_state)[0]))
target_f = self.model.predict(state)
target_f[0][action] = target
self.model.fit(state, target_f, epochs=1, verbose=0)
if self.epsilon > self.epsilon_min:
self.epsilon *= self.epsilon_decay
  1. The replay function is where the network training occurs. We first define a minibatch, which is defined from a random sampling of previous experiences grouped by batch_size. Then, we loop through the batches setting reward to the target and if not done calculating a new target based on the model prediction on the next_state. After that, we use the model.predict function on the state to determine the final target. Finally, we use the model.fit function to backpropagate the trained target back into the network.
    As this section is important, let's reiterate. Note the line where the variable target is calculated and set. These lines of code may look familiar, as they match the Q value equation we saw earlier. This target value is the value that should be predicted for the current action. This is the value that is backpropagated back for the current action and set by the returned reward.
  1. Run the sample and watch the agent train to balance the pole on the cart. The following shows the environment as it is being trained:
CartPole OpenAI Gym environment

The example environment uses the typical first environment, CartPole, we use to learn to build our first DRL model. In the next section, we will look at how to use the DQNAgent in other scenarios and other models supplied through the Keras-RL API.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset