Control a continuous system

Policy gradient algorithms such as REINFORCE and AC, as well as PPO and TRPO, all of which will be implemented in this chapter, can work with a discrete and continuous action space. The migration from one type of action to the other is pretty simple. Instead of computing a probability for each action in a continuous control, the actions can be specified through the parameters of a probability distribution. The most common approach is to learn the parameters of a normal Gaussian distribution, which is a very important family of distributions that is parametrized by a mean, , and a standard deviation, . Examples of Gaussian distributions and the change of these parameters are shown in the following figure:

Figure 7.2. A plot of three Gaussian distributions with different means and standard deviations
For all the color references mentioned in the chapter, please refer to the color images bundle at http://www.packtpub.com/sites/default/files/downloads/9781789131116_ColorImages.pdf.

For example, a policy that's represented by a parametric function approximation (such as deep neural networks) can predict the mean and the standard deviation of a normal distribution in the functionality of a state. The mean can be approximated as a linear function and, usually, the standard deviation is independent of the state. In this case, we'll represent the parameterized mean as a function of a state denoted by  and the standard deviation as a fixed value denoted by . Moreover, instead of working with standard deviation, it is preferred to use the logarithm of the standard deviation. 

Wrapping this up, a parametric policy for discrete control can be defined using the following line of code:

p_logits = mlp(obs_ph, hidden_sizes, act_dim, activation=tf.nn.relu, last_activation=None)

mlp is a function that builds a multi-layer perceptron (also called a fully connected neural network) with hidden layer sizes specified in hidden_sizes, an output of the act_dim dimension, and the activations specified in the activation and last_activation arguments. These will become part of a parametric policy for continuous control and will have the following changes:

p_means = mlp(obs_ph, hidden_sizes, act_dim, activation=tf.tanh, last_activation=None)
log_std = tf.get_variable(name='log_std', initializer=np.zeros(act_dim, dtype=np.float32))

Here p_means is  and log_std is .

Furthermore, if all the actions have a value between 0 and 1, it is better to use a tanh function as the last activation: 

p_means = mlp(obs_ph, hidden_sizes, act_dim, activation=tf.tanh, last_activation=tf.tanh)

Then, to sample from this Gaussian distribution and obtain the actions, the standard deviation has to be multiplied by a noisy vector that follows a normal distribution with a mean of 0 and a standard deviation of 1 that have been summed to the predicted mean:

Here, z is the vector of Gaussian noise, , with the same shape as  . This can be implemented in just one line of code: 

p_noisy = p_means + tf.random_normal(tf.shape(p_means), 0, 1) * tf.exp(log_std)

Since we are introducing noise, we cannot be sure that the values still lie in the limit of the actions, so we have to clip p_noisy in such a way that the action values remain between the minimum and maximum allowed values. The clipping is done in the following line of code:

act_smp = tf.clip_by_value(p_noisy, envs.action_space.low, envs.action_space.high)

In the end, the log probability is computed as follows:

This formula is computed in the gaussian_log_likelihood function, which returns the log probability. Thus, we can retrieve the log probability as follows:

p_log = gaussian_log_likelihood(act_ph, p_means, log_std)

Here, gaussian_log_likelihood is defined in the following snippet:

def gaussian_log_likelihood(x, mean, log_std):
log_p = -0.5 * (np.log(2*np.pi) + (x-mean)**2 / (tf.exp(log_std)**2 + 1e-9) + 2*log_std)
return tf.reduce_sum(log_p, axis=-1)

That's it. Now, you can implement it in every PG algorithm and try all sorts of environments with continuous action space. As you may recall, in the previous chapter, we implemented REINFORCE and AC on LunarLander. The same game is also available with continuous control and is called LunarLanderContinuous-v2

With the necessary knowledge to tackle problems with an inherent continuous action space, you are now able to address a broader variety of tasks. However, generally speaking, these are also more difficult to solve and the PG algorithms we've learned about so far are too weak and not best suited to solving hard problems. Thus, in the remaining chapters, we'll look at more advanced PG algorithms, starting with the natural policy gradient.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset