The computational graph and training loop

The core of the algorithm, namely the computational graph and the training (and evaluation) loop, is implemented in the DQN function, which takes the name of the environment and all the other hyperparameters as arguments:

def DQN(env_name, hidden_sizes=[32], lr=1e-2, num_epochs=2000, buffer_size=100000, discount=0.99, update_target_net=1000, batch_size=64, update_freq=4, frames_num=2, min_buffer_size=5000, test_frequency=20, start_explor=1, end_explor=0.1, explor_steps=100000):

env = make_env(env_name, frames_num=frames_num, skip_frames=True, noop_num=20)
env_test = make_env(env_name, frames_num=frames_num, skip_frames=True, noop_num=20)
env_test = gym.wrappers.Monitor(env_test, "VIDEOS/TEST_VIDEOS"+env_name+str(current_milli_time()),force=True, video_callable=lambda x: x%20==0)

obs_dim = env.observation_space.shape
act_dim = env.action_space.n

In the first few lines of the preceding code, two environments are created: one for training and one for testing. Moreover, gym.wrappers.Monitor is a Gym wrapper that saves the games of an environment in video format, while video_callable is a function parameter that establishes how often the videos are saved, which in this case is every 20 episodes.

Then, we can reset the TensorFlow graph and create placeholders for the observations, the actions, and the target values. This is done with the following lines of code:

    tf.reset_default_graph()
obs_ph = tf.placeholder(shape=(None, obs_dim[0], obs_dim[1], obs_dim[2]), dtype=tf.float32, name='obs')
act_ph = tf.placeholder(shape=(None,), dtype=tf.int32, name='act')
y_ph = tf.placeholder(shape=(None,), dtype=tf.float32, name='y')

Now, we can create a target and an online network by calling the qnet function that we defined previously. Because the target network has to update itself sometimes and take the parameters of the online network, we create an operation called update_target_op, which assigns every variable of the online network to the target network. This assignment is done by the TensorFlow assign method. tf.group, on the other hand, aggregates every element of the update_target list as a single operation. The implementation is as follows:

    with tf.variable_scope('target_network'):
target_qv = qnet(obs_ph, hidden_sizes, act_dim)
target_vars = tf.trainable_variables()

with tf.variable_scope('online_network'):
online_qv = qnet(obs_ph, hidden_sizes, act_dim)
train_vars = tf.trainable_variables()

update_target = [train_vars[i].assign(train_vars[i+len(target_vars)]) for i in range(len(train_vars) - len(target_vars))]
update_target_op = tf.group(*update_target)

Now that we have defined the placeholder that's created the deep neural network and defined the target update operation, all that remains is to define the loss function. The loss function is  (or, equivalently, (5.5)). It requires the target values, 
computed as they are in formula (5.6), which are passed through the y_ph placeholder and the Q-values of the online network, . A Q-value is dependent on the action, , but since the online network outputs a value for each action, we have to find a way to retrieve only the Q-value of  while discarding the other action-values. This operation can be achieved by using a one-hot encoding of the action, , and then multiplying it by the output of the online network. For example, if there are five possible actions and , then the one-hot encoding will be . Then, supposing that the network outputs , the results of the multiplication with the one-hot encoding will be . After, the q-value is obtained by summing this vector. The result will be . All of this is done in the following three lines of code:

    act_onehot = tf.one_hot(act_ph, depth=act_dim)
q_values = tf.reduce_sum(act_onehot * online_qv, axis=1)
v_loss = tf.reduce_mean((y_ph - q_values)**2)

To minimize the loss function we just defined, we will use Adam, a variant of SGD:

    v_opt = tf.train.AdamOptimizer(lr).minimize(v_loss)

This concludes the creation of the computation graph. Before going through the main DQN cycle, we have to prepare everything so that we can save the scalars and the histograms. By doing this, we will be able to visualize them later in TensorBoard:

    now = datetime.now()
clock_time = "{}_{}.{}.{}".format(now.day, now.hour, now.minute, int(now.second))

mr_v = tf.Variable(0.0)
ml_v = tf.Variable(0.0)

tf.summary.scalar('v_loss', v_loss)
tf.summary.scalar('Q-value', tf.reduce_mean(q_values))
tf.summary.histogram('Q-values', q_values)

scalar_summary = tf.summary.merge_all()
reward_summary = tf.summary.scalar('test_rew', mr_v)
mean_loss_summary = tf.summary.scalar('mean_loss', ml_v)

hyp_str = "-lr_{}-upTN_{}-upF_{}-frms_{}".format(lr, update_target_net, update_freq, frames_num)
file_writer = tf.summary.FileWriter('log_dir/'+env_name+'/DQN_'+clock_time+'_'+hyp_str, tf.get_default_graph())

Everything is quite self-explanatory. The only things that you may question are the mr_v and ml_v variables. These are variables we want to track with TensorBoard. However, because they aren't defined internally by the computation graph, we have to declare them separately and assign them in session.run later. FileWriter is created with a unique name and associated with the default graph.

We can now define the agent_op function that computes the forward pass on a scaled observation. The observation has already passed through the preprocessing pipeline (built in the environment with the wrappers), but we left the scaling aside:

    def agent_op(o):
o = scale_frames(o)
return sess.run(online_qv, feed_dict={obs_ph:[o]})

Then, the session is created, the variables are initialized, and the environment is reset:

    sess = tf.Session()
sess.run(tf.global_variables_initializer())

step_count = 0
last_update_loss = []
ep_time = current_milli_time()
batch_rew = []

obs = env.reset()

The next move involves instantiating the replay buffer, updating the target network so that it has the same parameters as the online network, and initializing the decay rate with eps_decay. The policy for the epsilon decay is the same as the one that was adopted in the DQN paper. A decay rate has been chosen so that, when it's applied linearly to the eps variable, it reaches a terminal value, end_explor, in about explor_steps steps. For example, if you want to decrease from 1.0 to 0.1 in 1,000 steps, you have to decrement the variable by a value equal to  on each step. All of this is accomplished in the following lines of code:

    obs = env.reset()

buffer = ExperienceBuffer(buffer_size)

sess.run(update_target_op)

eps = start_explor
eps_decay = (start_explor - end_explor) / explor_steps

As you may recall, the training loop comprises two inner cycles: the first iterates across the epochs while the other iterates across each transition of the epoch. The first part of the innermost cycle is quite standard. It selects an action following an -greedy behavior policy that uses the online network, takes a step in the environment, adds the new transition to the buffer, and finally, updates the variables:

    for ep in range(num_epochs):
g_rew = 0
done = False

while not done:
act = eps_greedy(np.squeeze(agent_op(obs)), eps=eps)
obs2, rew, done, _ = env.step(act)
buffer.add(obs, rew, act, obs2, done)

obs = obs2
g_rew += rew
step_count += 1

In the preceding code, obs takes the value of the next observation and the cumulative game reward is incremented.

Then, in the same cycle, eps is decayed and if some of the conditions are met, it trains the online network. These conditions make sure that the buffer has reached a minimal size and that the neural network is trained only once every update_freq steps. To train the online network, first, a minibatch is sampled from the buffer and the target values are calculated. Then, the session is run to minimize the loss function, v_loss, which feeds the dictionary with the target values, the actions, and the observations of the minibatch. While the session is running, it also returns v_loss and scalar_summary for statistics purposes. scalar_summary is then added to file_writer to be saved in the TensorBoard logging file. Finally, every update_target_net epochs, the target network is updated. A summary with the mean losses is also run and added to the TensorBoard logging file. All of this is done by the following snippet of code:

            if eps > end_explor:
eps -= eps_decay

if len(buffer) > min_buffer_size and (step_count % update_freq == 0):
mb_obs, mb_rew, mb_act, mb_obs2, mb_done = buffer.sample_minibatch(batch_size)
mb_trg_qv = sess.run(target_qv, feed_dict={obs_ph:mb_obs2})
y_r = q_target_values(mb_rew, mb_done, mb_trg_qv, discount) # Compute the target values
train_summary, train_loss, _ = sess.run([scalar_summary, v_loss, v_opt], feed_dict={obs_ph:mb_obs, y_ph:y_r, act_ph: mb_act})

file_writer.add_summary(train_summary, step_count)
last_update_loss.append(train_loss)

if (len(buffer) > min_buffer_size) and (step_count % update_target_net) == 0:
_, train_summary = sess.run([update_target_op, mean_loss_summary], feed_dict={ml_v:np.mean(last_update_loss)})
file_writer.add_summary(train_summary, step_count)
last_update_loss = []

When an epoch terminates, the environment is reset, the total reward of the game is appended to batch_rew, and the latter is set to zero. Moreover, every test_frequency epochs, the agent is tested for 10 games, and the statistics are added to file_writer. At the end of the training, the environments and the writer are closed. The code is as follows:

            if done:
obs = env.reset()
batch_rew.append(g_rew)
g_rew = 0
if ep % test_frequency == 0:
test_rw = test_agent(env_test, agent_op, num_games=10)
test_summary = sess.run(reward_summary, feed_dict={mr_v: np.mean(test_rw)})
file_writer.add_summary(test_summary, step_count)
print('Ep:%4d Rew:%4.2f, Eps:%2.2f -- Step:%5d -- Test:%4.2f %4.2f' % (ep,np.mean(batch_rew), eps, step_count, np.mean(test_rw), np.std(test_rw))
batch_rew = []
file_writer.close()
env.close()
env_test.close()

That's it. We can now call the DQN function with the name of the Gym environment and all the hyperparameters:

if __name__ == '__main__':
DQN
('PongNoFrameskip-v4', hidden_sizes=[128], lr=2e-4, buffer_size=100000, update_target_net=1000, batch_size=32, update_freq=2, frames_num=2, min_buffer_size=10000)

There's one last note before reporting the results. The environment that's being used here isn't the default version of Pong-v0 but a modified version of it. The reason for this is that in the regular version, each action is performed 2, 3, or 4 times where this number is sampled uniformly. But because we want to skip a fixed number of times, we opted for the version without the built-in skip feature, NoFrameskip, and added the custom MaxAndSkipEnv wrapper.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset