Epochal stochastic bandit algorithm selection

The main use of exploration strategies in reinforcement learning is to help the agent in the exploration of the environment. We saw this use case in DQN with -greedy, and in other algorithms with the injection of additional noise into the policy. However, there are other ways of using exploration strategies. So, to better grasp the exploration concepts that have been presented so far, and to introduce an alternative use case of these algorithms, we will present and develop an algorithm called ESBAS. This algorithm was introduced in the paper, Reinforcement Learning Algorithm Selection.

ESBAS is a meta-algorithm for online algorithm selection (AS) in the context of reinforcement learning. It uses exploration methods in order to choose the best algorithm to employ during a trajectory, so as to maximize the expected reward.

In order to better explain ESBAS, we'll first explain what algorithm selection is and how it can be used in machine learning and reinforcement learning. Then, we'll focus on ESBAS, and give a detailed description of its inner workings, while also providing its pseudocode. Finally, we'll implement ESBAS and test it on an environment called Acrobot. 

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset