Implementing ME-TRPO

The code of ME-TRPO is quite long and, in this section, we won't give you the full code. Also, many parts are not interesting, and all the code concerning TRPO has already been discussed in Chapter 7, TRPO and PPO Implementation. However, if you are interested in the complete implementation, or if you want to play with the algorithm, the full code is available in the GitHub repository of this chapter.

Here, we'll provide an explanation and the implementation of the following:

The inner cycle, where the games are simulated and the policy is optimized
The function that trains the models

The remaining code is very similar to that of TRPO.

The following steps will guide us through the process of building and implementing the core of ME-TRPO:

Changing the policy: The only change in the interaction procedure with the real environment is the policy. In particular, the policy will act randomly on the first episode but, on the others, it will sample the actions from a Gaussian distribution with a random standard deviation fixed at the start of the algorithm. This change is done by replacing the line, act, val = sess.run([a_sampl, s_values], feed_dict={obs_ph:[env.n_obs]}), in the TRPO implementation with the following lines of code:

...
if ep == 0:
    act = env.action_space.sample()
else:
    act = sess.run(a_sampl, feed_dict={obs_ph:[env.n_obs], log_std:init_log_std})
...

Fitting the deep neural networks, : The neural networks learn the model of the environment with the dataset acquired in the preceding step. The dataset is divided into a training and a validation set, wherein the validation set is used by the early stopping technique to determine whether it is worth continuing with the training:

...
model_buffer.generate_random_dataset()
train_obs, train_act, _, train_nxt_obs, _ = model_buffer.get_training_batch()
valid_obs, valid_act, _, valid_nxt_obs, _ = model_buffer.get_valid_batch()
print('Log Std policy:', sess.run(log_std))
        
for i in range(num_ensemble_models):
train_model(train_obs, train_act, train_nxt_obs, valid_obs, valid_act, valid_nxt_obs, step_count, i)

model_buffer is an instance of the FullBuffer class that contains the samples generated by the environment, and generate_random_dataset creates two partitions for training and validation, which are then returned by calling get_training_batch and get_valid_batch.

In the next lines, each model is trained with the train_model function by passing the datasets, the current number of steps, and the index of the model that has to be trained. num_ensemble_models is the total number of models that populate the ensemble. In the ME-TRPO paper, it is shown that 5 to 10 models are sufficient. The argument, i, establishes which model of the ensemble has to be optimized.

Generating fictitious trajectories in the simulated environments and fitting the policy:

        best_sim_test = np.zeros(num_ensemble_models)
        for it in range(80):
            obs_batch, act_batch, adv_batch, rtg_batch = simulate_environment(sim_env, action_op_noise, simulated_steps)

            policy_update(obs_batch, act_batch, adv_batch, rtg_batch)

This is repeated 80 times or at least until the policy continues improving. simulate_environment collects a dataset (constituted by observations, actions, advantages, values, and return values) by rolling the policy in the simulated environment (represented by the learned models). In our case, the policy is represented by the function, action_op_noise, which, when given a state, returns an action following the learned policy. Instead, the environment, sim_env, is a model of the environment, , chosen randomly at each step among those in the ensemble. The last argument passed to the simulated_environment function is simulated_steps, which establishes the number of steps to take in the fictitious environments.

Ultimately, the policy_update function does a TRPO step to update the policy with the data collected in the fictitious environments.

Implementing the early step mechanism and evaluating the policy: The early stopping mechanism prevents the policy from overfitting on the models of the environment. It works by monitoring the performance of the policy on each separate model. If the percentage of models on which the policy improved exceeds a certain threshold, then the cycle is terminated. This should be a good indication of whether the policy has started to overfit. Note that, unlike the training, during testing, the policy is tested on one model at a time. During training, each trajectory is produced by all the learned models of the environment:

            if (it+1) % 5 == 0:
                sim_rewards = []

                for i in range(num_ensemble_models):
                    sim_m_env = NetworkEnv(gym.make(env_name), model_op, pendulum_reward, pendulum_done, i+1)
                    mn_sim_rew, _ = test_agent(sim_m_env, action_op, num_games=5)
                    sim_rewards.append(mn_sim_rew)

                sim_rewards = np.array(sim_rewards)
                if (np.sum(best_sim_test >= sim_rewards) > int(num_ensemble_models*0.7)) 
                    or (len(sim_rewards[sim_rewards >= 990]) > int(num_ensemble_models*0.7)):
                    break
                else:
                  best_sim_test = sim_rewards

The evaluation of the policy is done every five training iterations. For each model of the ensemble, a new object of the NetworkEnv class is instantiated. It provides the same functionalities of a real environment but, under the hood, it returns transitions from a learned model of the environment. NetworkEnv does this by inheriting Gym.wrapper and overriding the reset and step functions. The first parameter of the constructor is a real environment that is used merely to get a real initial state, while model_os is a function that, when given a state and action, produces the next state. Lastly, pendulum_reward and pendulum_done are functions that return the reward and the done flag. These two functions are built around the particular functionalities of the environment.

Training the dynamic model: The train_model function optimizes a model to predict the future state. It is very simple to understand. We used this function in step 2, when we were training the ensemble of models. train_model is an inner function and takes the arguments that we saw earlier. On each ME-TRPO iteration of the outer loop, we retrain all the models, that is, we train the models starting from their random initial weights; we don't resume from the preceding optimization. Hence, every time train_model is called and before the training takes place, we restore the initial random weights of the model. The following code snippet restores the weights and computes the loss before and after this operation:

    def train_model(tr_obs, tr_act, tr_nxt_obs, v_obs, v_act, v_nxt_obs, step_count, model_idx):
        mb_valid_loss1 = run_model_loss(model_idx, v_obs, v_act, v_nxt_obs)

        model_assign(model_idx, initial_variables_models[model_idx])

        mb_valid_loss = run_model_loss(model_idx, v_obs, v_act, v_nxt_obs)

run_model_loss returns the loss of the current model, and model_assign restores the parameters that are in initial_variables_models[model_idx].

We then train the model, as long as the loss on the validation set improved in the last model_iter iterations. But because the best model may not be the last one, we keep track of the best one and restore its parameters at the end of the training. We also randomly shuffle the dataset and divide it into mini-batches. The code is as follows:

        acc_m_losses = []
        last_m_losses = []
        md_params = sess.run(models_variables[model_idx])
        best_mb = {'iter':0, 'loss':mb_valid_loss, 'params':md_params}
        it = 0

        lb = len(tr_obs)
        shuffled_batch = np.arange(lb)
        np.random.shuffle(shuffled_batch)

        while best_mb['iter'] > it - model_iter:
            
            # update the model on each mini-batch
            last_m_losses = []
            for idx in range(0, lb, model_batch_size):
                minib = shuffled_batch[idx:min(idx+minibatch_size,lb)]
                
                if len(minib) != minibatch_size:
                  _, ml = run_model_opt_loss(model_idx, tr_obs[minib], tr_act[minib], tr_nxt_obs[minib])
                  acc_m_losses.append(ml)
                  last_m_losses.append(ml)

            # Check if the loss on the validation set has improved
            mb_valid_loss = run_model_loss(model_idx, v_obs, v_act, v_nxt_obs)
            if mb_valid_loss < best_mb['loss']:
                best_mb['loss'] = mb_valid_loss
                best_mb['iter'] = it
                best_mb['params'] = sess.run(models_variables[model_idx])

            it += 1

        # Restore the model with the lower validation loss
        model_assign(model_idx, best_mb['params'])

        print('Model:{}, iter:{} -- Old Val loss:{:.6f} New Val loss:{:.6f} -- New Train loss:{:.6f}'.format(model_idx, it, mb_valid_loss1, best_mb['loss'], np.mean(last_m_losses)))

run_model_opt_loss is a function that executes the optimizer of the model with the model_idx index.

This concludes the implementation of ME-TRPO. In the next section, we'll see how it performs.

Table of Contents for Implementing ME-TRPO

Create new playlist

Sign In

Sign Up

Table of Contents for
Implementing ME-TRPO