Under the hood of ESBAS

The paper that proposes ESBAS, tests the algorithm on batch and online settings. However, in the remainder of the chapter, we'll focus primarily on the former. The two algorithms are very similar, and if you are interested in the pure online version, you can find a further explanation of it in the paper. The AS in true online settings is renamed as sliding stochastic bandit AS (SSBAS), as it learns from a sliding window of the most recent selections. But let's start from the foundations. 

The first thing to say about ESBAS, is that it is based on the UCB1 strategy, and that it uses this bandit-style selection for choosing an off-policy algorithm from the fixed portfolio. In particular, ESBAS can be broken down into three main parts that work as follows:

  1. It cycles across many epochs of exponential size. Inside each epoch, the first thing that it does is update all of the off-policy algorithms that are available in the portfolio. It does this using the data that has been collected until that point in time (at the first epoch the dataset will be empty). The other thing that it does, is reset the meta-algorithm. 
  2. Then, during the epoch, the meta-algorithm computes the optimistic guess, following the formula (12.3), in order to choose the off-policy algorithm (among those in the portfolio) that will control the next trajectory, so as to minimize the total regret. The trajectory is then run with that algorithm. Meanwhile, all the transitions of the trajectory are collected and added to the dataset that will be later used by the off-policy algorithms to train the policies.
  3. When a trajectory has come to an end, the meta-algorithm updates the mean reward of that particular off-policy algorithm with the RL return that is obtained from the environment, and increases the number of occurrences. The average reward, and the number of occurrences, will be used by UCB1 to compute the UCB, as from formula (12.2). These values are used to choose the next off-policy algorithm that will roll out the next trajectory.

To give you a better view of the algorithm, we also provided the pseudocode of ESBAS in the code block, here:

---------------------------------------------------------------------------------
ESBAS
---------------------------------------------------------------------------------

Initialize policy for every algorithm in the portfolio
Initialize empty dataset

for do
for in do
Learn policy on with algortihm

Initialize AS variables: and for every :

for do
> Select the best algorithm according to UCB1

Generate trajectory with policy and add transitions to

> Update the average return and the counter of

(12.4)

Here,  is a hyperparameter,  is the RL return obtained during the trajectory,  is the counter of algorithm , and  is its mean return.

As explained in the paper, online AS addresses four practical problems that are inherited from RL algorithms:

  1. Sample efficiency: The diversification of the policies provides an additional source of information that makes ESBAS sample efficient. Moreover, it combines properties from curriculum learning and ensemble learning.
  2. Robustness: The diversification of the portfolio provides robustness against bad algorithms. 
  3. Convergence: ESBAS guarantees the minimization of the regret.
  4. Curriculum learning: AS is able to provide a sort of curriculum strategy, for example, by choosing easier, shallow models at the beginning, and deep models toward the end.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset