Creating the learner's computational graph

All the following code is located inside a function called DAgger, which takes some hyperparameters that we'll see throughout the code as arguments.

The learner's computational graph is simple as its only goal is to build a classifier. In our case, there are only two actions to predict, one for doing nothing, and the other to make the bird flap its wings. We can instantiate two placeholders, one for the input state, and one for the ground-truth actions that are those of the expert. The actions are an integer corresponding to the action taken. In the case of two possible actions, they are just 0 (do nothing) or 1 (fly).

The steps to build such a computational graph are the following:

  1. Create a deep neural network, specifically, a fully connected multilayer perceptron with a ReLu activations function in the hidden layers and a linear function on the final layer.
  2. For every input state, take the action with the highest value. This is done using the tf.math.argmax(tensor,axis) function with axis=1.
  3. Convert the action's placeholders in a one-hot tensor. This is needed because the logits and labels that we'll use in the loss function should have dimensions, [batch_size, num_classes]. However, our labels named act_ph have shapes, [batch_size]. Therefore, we convert them to the desired shape with one-hot encoding. tf.one_hot is the TensorFlow function that does just that.
  4. Create the loss function. We use the softmax cross-entropy loss function. This is a standard loss function used for discrete classification with mutually exclusive classes, just like in our case. The loss function is computed using softmax_cross_entropy_with_logits_v2(labels, logits) between the logits and the labels.
  5. Lastly, the mean of the softmax cross-entropy is computed across the batch and minimized using Adam.

These five steps are implemented in the following lines:

    obs_ph = tf.placeholder(shape=(None, obs_dim), dtype=tf.float32, name='obs')
act_ph = tf.placeholder(shape=(None,), dtype=tf.int32, name='act')

p_logits
= mlp(obs_ph, hidden_sizes, act_dim, tf.nn.relu, last_activation=None)
act_max = tf.math.argmax(p_logits, axis=1)
act_onehot = tf.one_hot(act_ph, depth=act_dim)

p_loss
= tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits_v2(labels=act_onehot, logits=p_logits))
p_opt = tf.train.AdamOptimizer(p_lr).minimize(p_loss)

We can then initialize a session, the global variables, and define a function, learner_policy(state). This function, given a state, returns the action with a higher probability chosen by the learner (this is the same thing we did for the expert):

    sess = tf.Session()
sess.run(tf.global_variables_initializer())

def learner_policy(state):
action = sess.run(act_max, feed_dict={obs_ph:[state]})
return np.squeeze(action)
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset