Workers

The workers' functionalities are defined in the worker function, which was previously passed as an argument to mp.Process. We cannot go through all the code because it'd take too much time and space to explain, but we'll explain the core components here. As always, the full implementation is available in this book's repository on GitHub. So, if you are interested in looking at it in more depth, take the time to examine the code on GitHub. 

In the first few lines of worker, the computational graph is created to run the policy and optimize it. Specifically, the policy is a multi-layer perceptron with tanh nonlinearities as the activation function. In this case, Adam is used to apply the expected gradient that's computed following the second term of (11.2).

Then, agent_op(o) and evaluation_on_noise(noise) are defined. The former runs the policy (or candidate solution) to obtain the action for a given state or observation, o, and the latter evaluates the new candidate solution that is obtained by adding the perturbation noise (that has the same shape as the policy) to the current policy's parameters. 

Jumping directly to the most interesting part, we create a new session by specifying that it can rely on, at most, 4 CPUs and initialize the global variables. Don't worry if you don't have 4 CPUs available. Setting allow_soft_placement to True tells TensorFlow to use only the supported devices:

    sess = tf.Session(config=tf.ConfigProto(device_count={'CPU': 4}, allow_soft_placement=True))
sess.run(tf.global_variables_initializer())

Despite using all 4 CPUs, we allocate only one to each worker. In the definition of the computational graph, we set the device on which the computation will be performed. For example, to specify that the worker has to use only CPU 0, you can put the graph inside a with statement, which defines the device to use:

with tf.device("/cpu:0"):
# graph to compute on the CPUs 0

Going back to our implementation, we can loop forever, or at least until the worker has something to do. This condition is checked later, inside the while cycle. 

An important thing to note is that because we perform many calculations on the weights of the neural network, it is much easier to deal with flattened weights. So, for example, instead of dealing with a list of the form [8,32,32,4], we'll perform computations on a one-dimensional array of length 8*32*32*4. The functions that perform the conversion from the former to the latter, and vice versa, are defined in TensorFlow (take a look at the full implementation on GitHub if you are interested in knowing how this is done).

Also, before starting the while loop, we retrieve the shape of the flattened agent:

    agent_flatten_shape = sess.run(agent_variables_flatten).shape

while True:

In the first part of the while loop, the candidates are generated and evaluated. The candidate solutions are built by adding a normal perturbation to the weights; that is, . This is done by choosing a new random seed every time, which will uniquely sample the perturbation (or noise), , from a normal distribution. This is a key part of the algorithm because, later, the other workers will have to regenerate the same perturbation from the same seed. After that, the two new offspring (there are two because we are using mirror sampling) are evaluated and the results are put in the output_queue queue:

        for _ in range(indiv_per_worker):
seed = np.random.randint(1e7)

with temp_seed(seed):
sampled_noise = np.random.normal(size=agent_flatten_shape)

pos_rew= evaluation_on_noise(sampled_noise)
neg_rew = evaluation_on_noise(-sampled_noise)

output_queue.put([[pos_rew, neg_rew], seed])

Note that the following snippet (which we used previously), is just a way to set the NumPy random seed, seed, locally:

with temp_seed(seed):
..

Outside the with statement, the seed that's used to generate random values will not be seed anymore.

The second part of the while loop involves the acquisition of all the returns and seeds, the reconstruction of the perturbations from those seeds, the computation of the stochastic gradient estimate following the formula (11.2), and the policy's optimization. The params_queue queue is populated by the main process, which we saw earlier. It does this by sending the normalized ranks and seeds of the population that were generated by the workers in the first phase. The code is as follows:

        batch_return, batch_seed = params_queue.get()
batch_noise = []

# reconstruction of the perturbations used to generate the individuals
for seed in batch_seed:
with temp_seed(seed):
sampled_noise = np.random.normal(size=agent_flatten_shape)

batch_noise.append(sampled_noise)
batch_noise.append(-sampled_noise)

# Computation of the gradient estimate following the formula (11.2)
vars_grads = np.zeros(agent_flatten_shape)
for n, r in zip(batch_noise, batch_return):
vars_grads += n * r

vars_grads /= len(batch_noise) * std_noise

sess.run(apply_g, feed_dict={new_weights_ph:-vars_grads})

The last few lines in the preceding code compute the gradient estimate; that is, they calculate the second term of formula (11.2):

Here,  is the normalized rank of  and  candidates their perturbation. 

apply_g is the operation that applies the vars_grads gradient (11.3) using Adam. Note that we pass -var_grads as we want to perform gradient ascent and not gradient descent.

That's all for the implementation. Now, we have to apply it to an environment and test it to see how it performs.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset