Implementing REINFORCE

It's time to implement REINFORCE. Here, we provide a mere implementation of the algorithm, without the procedures for its debugging and monitoring. The complete implementation is available in the GitHub repository. So, make sure that you check it out.

The code is divided into three main functions, and one class:

REINFORCE(env_name, hidden_sizes, lr, num_epochs, gamma, steps_per_epoch): This is the function that contains the main implementation of the algorithm.
Buffer: This is a class that is used to temporarily store the trajectories.
mlp(x, hidden_layer, output_size, activation, last_activation): This is used to build a multi-layer perceptron in TensorFlow.
discounted_rewards(rews, gamma): This computes the discounted reward to go.

We'll first look at the main REINFORCE function, and then implement the supplementary functions and class.

The REINFORCE function is divided into two main parts. In the first part, the computational graph is created, while in the second, the environment is run and the policy is optimized cyclically until a convergence criterion is met.

The REINFORCE function takes the name of the env_name environment as the input, a list with the sizes of the hidden layers—hidden_sizes, the learning rate—lr, the number of training epochs—num_epochs, the discount value—gamma, and the minimum number of steps per epoch—steps_per_epoch. Formally, the heading of REINFORCE is as follows:

def REINFORCE(env_name, hidden_sizes=[32], lr=5e-3, num_epochs=50, gamma=0.99, steps_per_epoch=100):

At the beginning of REINFORCE(..), the TensorFlow default graph is reset, an environment is created, the placeholder is initialized, and the policy is created. The policy is a fully connected multi-layer perceptron, with an output for each action, and tanh activation, on each hidden layer. The outputs of the multi-layer perceptron are the unnormalized values of the actions, called logits. All this is done in the following snippet:

def REINFORCE(env_name, hidden_sizes=[32], lr=5e-3, num_epochs=50, gamma=0.99, steps_per_epoch=100):

    tf.reset_default_graph()

    env = gym.make(env_name) 
    obs_dim = env.observation_space.shape
    act_dim = env.action_space.n 

    obs_ph = tf.placeholder(shape=(None, obs_dim[0]), dtype=tf.float32, name='obs')
    act_ph = tf.placeholder(shape=(None,), dtype=tf.int32, name='act')
    ret_ph = tf.placeholder(shape=(None,), dtype=tf.float32, name='ret')
    
    p_logits = mlp(obs_ph, hidden_sizes, act_dim, activation=tf.tanh)

We can then create an operation that will compute the loss function, and one that will optimize the policy. The code is similar to the code that we saw earlier, in the The policy section. The only difference is that now the actions are sampled by tf.random.multinomial , which follows the action distribution that is returned by the policy. This function draws samples from a categorical distribution. In our case, it chooses a single action (depending on the environment, it could be more than one action).

The following snippet is the implementation of the REINFORCE update:

 act_multn = tf.squeeze(tf.random.multinomial(p_logits, 1))
 actions_mask = tf.one_hot(act_ph, depth=act_dim)
 p_log = tf.reduce_sum(actions_mask * tf.nn.log_softmax(p_logits), axis=1)
 p_loss = -tf.reduce_mean(p_log*ret_ph)
 p_opt = tf.train.AdamOptimizer(lr).minimize(p_loss)

A mask is created over the actions that are chosen during the interaction with the environment and multiplied by log_softmax in order to obtain . Then, the full loss function is computed. Be careful—there is a minus sign before tf.reduce_sum. We are interested in the maximization of the objective function. But because the optimizer needs a function to minimize, we have to pass a loss function. The last line optimizes the PG loss function using AdamOptimizer.

We are now ready to start a session, reset the global variables of the computational graph, and initialize some further variables that we'll use later:

    sess = tf.Session()
    sess.run(tf.global_variables_initializer())
    step_count = 0
    train_rewards = []
    train_ep_len = []

Then, we create the two inner cycles that will interact with the environment to gather experience and optimize the policy, and print a few statistics:

    for ep in range(num_epochs):
        obs = env.reset()
        buffer = Buffer(gamma)
        env_buf = []
        ep_rews = []

        while len(buffer) < steps_per_epoch:

            # run the policy 
            act = sess.run(act_multn, feed_dict={obs_ph:[obs]})
            # take a step in the environment
            obs2, rew, done, _ = env.step(np.squeeze(act))

            env_buf.append([obs.copy(), rew, act])
            obs = obs2.copy()
            step_count += 1
            ep_rews.append(rew)

            if done:
                # add the full trajectory to the environment
                buffer.store(np.array(env_buf))
                env_buf = []
                train_rewards.append(np.sum(ep_rews))
                train_ep_len.append(len(ep_rews))
                obs = env.reset()
                ep_rews = []
    
        obs_batch, act_batch, ret_batch = buffer.get_batch()
        # Policy optimization
        sess.run(p_opt, feed_dict={obs_ph:obs_batch, act_ph:act_batch, ret_ph:ret_batch})

        # Print some statistics
        if ep % 10 == 0:
            print('Ep:%d MnRew:%.2f MxRew:%.1f EpLen:%.1f Buffer:%d -- Step:%d --' % (ep, np.mean(train_rewards), np.max(train_rewards), np.mean(train_ep_len), len(buffer), step_count))
            train_rewards = []
            train_ep_len = []
    env.close()

The two cycles follow the usual flow, with the exception that the interaction with the environment stops whenever the trajectory ends, and the temporary buffer has enough transitions.

We can now implement the Buffer class that contains the data of the trajectories:

class Buffer():
    def __init__(self, gamma=0.99):
        self.gamma = gamma
        self.obs = []
        self.act = []
        self.ret = []

    def store(self, temp_traj):
        if len(temp_traj) > 0:
            self.obs.extend(temp_traj[:,0])
            ret = discounted_rewards(temp_traj[:,1], self.gamma)
            self.ret.extend(ret)
            self.act.extend(temp_traj[:,2])

    def get_batch(self):
        return self.obs, self.act, self.ret
    
    def __len__(self):
        assert(len(self.obs) == len(self.act) == len(self.ret))
        return len(self.obs)

And finally, we can implement the function that creates the neural network with an arbitrary number of hidden layers:

def mlp(x, hidden_layers, output_size, activation=tf.nn.relu, last_activation=None):
for l in hidden_layers:
    x = tf.layers.dense(x, units=l, activation=activation)
    return tf.layers.dense(x, units=output_size, activation=last_activation)

Here, activation is the non-linear function that is applied to the hidden layers, and last_activation is the non-linearity function that is applied to the output layer.

Table of Contents for Implementing REINFORCE

Create new playlist

Sign In

Sign Up

Table of Contents for
Implementing REINFORCE