Implementation

The implementation of ESBAS is easy, as it involves the addition of only a few components. The most substantial part is in the definition and the optimization of the off-policy algorithms of the portfolio. Regarding these, ESBAS does not bind the choice of the algorithms. In the paper, both Q-learning and DQN are used. We have decided to use DQN, so as to provide an algorithm that is capable of dealing with more complex tasks that can be used with environments with the RGB state space. We went through DQN in great detail in Chapter 5, Deep Q-Network, and for ESBAS, we'll use the same implementation.

The last thing that we need to specify before going through the implementation is the portfolio's composition. We created a diversified portfolio, as regards the neural network architecture, but you can try with other combinations. For example, you could compose the portfolio with DQN algorithms of different learning rates.

The implementation is divided as follows:

The DQN_optimization class builds the computational graph, and optimizes a policy with DQN.
The UCB1 class defines the UCB1 algorithm.
The ESBAS function implements the main pipeline for ESBAS.

We'll provide the implementation of the last two bullet points, but you can find the full implementation on the GitHub repository of the book: https://github.com/PacktPublishing/Reinforcement-Learning-Algorithms-with-Python .

Let's start by going through ESBAS(..). Besides the hyperparameters of DQN, there's only an additional xi argument that represents the hyperparameter. The main outline of the ESBAS function is the same as the pseudocode that was given previously, so we can quickly go through it.

After having defined the function with all the arguments, we can reset the default graph of TensorFlow, and create two Gym environments (one for training, and one for testing). We can then create the portfolio, by instantiating a DQN_optimization object for each of the neural network sizes, and appending them on a list:

def ESBAS(env_name, hidden_sizes=[32], lr=1e-2, num_epochs=2000, buffer_size=100000, discount=0.99, render_cycle=100, update_target_net=1000, batch_size=64, update_freq=4, min_buffer_size=5000, test_frequency=20, start_explor=1, end_explor=0.1, explor_steps=100000, xi=16000):
    
    tf.reset_default_graph()

    env = gym.make(env_name)
    env_test = gym.wrappers.Monitor(gym.make(env_name), "VIDEOS/TEST_VIDEOS"+env_name+str(current_milli_time()),force=True, video_callable=lambda x: x%20==0)
    
    dqns = []
    for l in hidden_sizes:
        dqns.append(DQN_optimization(env.observation_space.shape, env.action_space.n, l, lr, discount))

Now, we define an inner function, DQNs_update, that trains the policies in the portfolio in a DQN way. Take into consideration that all the algortihms in the portfolio are DQN, and that the only difference is in their neural network size. The optimization is done by the optimize and update_target_network methods of the DQN_optimization class:

    def DQNs_update(step_counter):
        if len(buffer) > min_buffer_size and (step_counter % update_freq == 0):
            mb_obs, mb_rew, mb_act, mb_obs2, mb_done = buffer.sample_minibatch(batch_size)
            for dqn in dqns:
                dqn.optimize(mb_obs, mb_rew, mb_act, mb_obs2, mb_done)
        
        if len(buffer) > min_buffer_size and (step_counter % update_target_net == 0):
            for dqn in dqns:
                dqn.update_target_network()

As always, we need to initialize some (self-explanatory) variables: resetting the environment, instantiating an object of ExperienceBuffer (using the same classes that we used in others chapters), and setting up the exploration decay:

    step_count = 0
    batch_rew = []
    episode = 0
    beta = 1
    
    buffer = ExperienceBuffer(buffer_size)
    obs = env.reset()
    
    eps = start_explor
    eps_decay = (start_explor - end_explor) / explor_steps

We can finally start the loop that iterates across the epochs. As for the preceding pseudocode, during each epoch, the following things occur:

The policies are trained on the experience buffer
The trajectories are run by the policy that is chosen by UCB1

The first step is done by invoking DQNs_update, which we defined earlier, for the entire length of the epoch (which has an exponential length):

    for ep in range(num_epochs):
        # policies training
        for i in range(2**(beta-1), 2**beta):
            DQNs_update(i)

With regard to the second step, just before the trajectories are run, a new object of the UCB1 class is instantiated and initialized. Then, a while loop iterates over the episodes of exponential size, inside of which, the UCB1 object chooses which algorithm will run the next trajectory. During the trajectory, the actions are selected by dqns[best_dqn]:

        ucb1 = UCB1(dqns, xi)
        list_bests = []
        beta += 1
        ep_rew = []

        while step_count < 2**beta:
            best_dqn = ucb1.choose_algorithm()
            list_bests.append(best_dqn)

            g_rew = 0
            done = False

            while not done:
                # Epsilon decay
                if eps > end_explor:
                    eps -= eps_decay

                act = eps_greedy(np.squeeze(dqns[best_dqn].act(obs)), eps=eps)
                obs2, rew, done, _ = env.step(act)
                buffer.add(obs, rew, act, obs2, done)

                obs = obs2
                g_rew += rew
                step_count += 1

After each rollout, ucb1 is updated with the RL return that was obtained in the last trajectory. Moreover, the environment is reset, and the reward of the current trajectory is appended to a list in order to keep track of all the rewards:

            ucb1.update(best_dqn, g_rew)

            obs = env.reset()
            ep_rew.append(g_rew)
            g_rew = 0
            episode += 1

That's all for the ESBAS function.

UCB1 is made up of a constructor that initializes the attributes that are needed for computing (12.3); a choose_algorithm() method that returns the current best algorithm among the ones in the portfolio, as in (12.3); and update(idx_algo, traj_return) , which updates the average reward of the idx_algo algorithm with the last reward that was obtained, as understood from (12.4). The code is as follows:

class UCB1:
    def __init__(self, algos, epsilon):
        self.n = 0
        self.epsilon = epsilon
        self.algos = algos
        self.nk = np.zeros(len(algos))
        self.xk = np.zeros(len(algos))

    def choose_algorithm(self):
        return np.argmax([self.xk[i] + np.sqrt(self.epsilon * np.log(self.n) / self.nk[i]) for i in range(len(self.algos))])

    def update(self, idx_algo, traj_return):
        self.xk[idx_algo] = (self.nk[idx_algo] * self.xk[idx_algo] + traj_return) / (self.nk[idx_algo] + 1)
        self.nk[idx_algo] += 1
        self.n += 1

With the code at hand, we can now test it on an environment and see how it performs.

Table of Contents for Implementation

Create new playlist

Sign In

Sign Up

Table of Contents for
Implementation