Under the hood of ESBAS

The paper that proposes ESBAS, tests the algorithm on batch and online settings. However, in the remainder of the chapter, we'll focus primarily on the former. The two algorithms are very similar, and if you are interested in the pure online version, you can find a further explanation of it in the paper. The AS in true online settings is renamed as sliding stochastic bandit AS (SSBAS), as it learns from a sliding window of the most recent selections. But let's start from the foundations.

The first thing to say about ESBAS, is that it is based on the UCB1 strategy, and that it uses this bandit-style selection for choosing an off-policy algorithm from the fixed portfolio. In particular, ESBAS can be broken down into three main parts that work as follows:

It cycles across many epochs of exponential size. Inside each epoch, the first thing that it does is update all of the off-policy algorithms that are available in the portfolio. It does this using the data that has been collected until that point in time (at the first epoch the dataset will be empty). The other thing that it does, is reset the meta-algorithm.
Then, during the epoch, the meta-algorithm computes the optimistic guess, following the formula (12.3), in order to choose the off-policy algorithm (among those in the portfolio) that will control the next trajectory, so as to minimize the total regret. The trajectory is then run with that algorithm. Meanwhile, all the transitions of the trajectory are collected and added to the dataset that will be later used by the off-policy algorithms to train the policies.
When a trajectory has come to an end, the meta-algorithm updates the mean reward of that particular off-policy algorithm with the RL return that is obtained from the environment, and increases the number of occurrences. The average reward, and the number of occurrences, will be used by UCB1 to compute the UCB, as from formula (12.2). These values are used to choose the next off-policy algorithm that will roll out the next trajectory.

To give you a better view of the algorithm, we also provided the pseudocode of ESBAS in the code block, here:

---------------------------------------------------------------------------------
ESBAS
---------------------------------------------------------------------------------

Initialize policy  for every algorithm  in the portfolio 
Initialize empty dataset 

for  do
    for  in  do
        Learn policy  on  with algortihm  
    
    Initialize AS variables:  and for every : 

    for  do
        > Select the best algorithm according to UCB1
        
        Generate trajectory  with policy  and add transitions to 
        
        > Update the average return and the counter of 
        
         (12.4)

Here, is a hyperparameter, is the RL return obtained during the trajectory, is the counter of algorithm , and is its mean return.

As explained in the paper, online AS addresses four practical problems that are inherited from RL algorithms:

Sample efficiency: The diversification of the policies provides an additional source of information that makes ESBAS sample efficient. Moreover, it combines properties from curriculum learning and ensemble learning.
Robustness: The diversification of the portfolio provides robustness against bad algorithms.
Convergence: ESBAS guarantees the minimization of the regret.
Curriculum learning: AS is able to provide a sort of curriculum strategy, for example, by choosing easier, shallow models at the beginning, and deep models toward the end.

Table of Contents for Under the hood of ESBAS

Create new playlist

Sign In

Sign Up

Table of Contents for
Under the hood of ESBAS