Developing the ESBAS Algorithm

By now, you are capable of approaching RL problems in a systematic and concise way. You are able to design and develop RL algorithms specifically for the problem at hand and get the most from the environment. Moreover, in the previous two chapters, you learned about algorithms that go beyond RL, but that can be used to solve the same set of tasks. 

At the beginning of this chapter, we'll present a dilemma that we have already encountered in many of the previous chapters; namely, the exploration-exploitation dilemma. We have already presented potential solutions for the dilemma throughout the book (such as the -greedy strategy), but we want to give you a more comprehensive outlook on the problem, and a more concise view of the algorithms that solve it. Many of them, such as the upper confidence bound (UCB) algorithm, are more sophisticated and better than the simple heuristics that we have used so far, such as the -greedy strategy. We'll illustrate these strategies on a classic problem, known as multi-armed bandit. Despite being a simple tabular game, we'll use it as a starting point to then illustrate how these strategies can also be employed on non-tabular and more complex tasks.

This introduction to the exploration-exploitation dilemma offers a general overview of the main methods that many recent RL algorithms employ in order to solve very hard exploration environments. We'll also provide a broader view of the applicability of this dilemma when solving other kinds of problems. As proof of that, we'll develop a meta-algorithm called epochal stochastic bandit algorithm selection, or ESBAS, which tackles the problem of online algorithm selection in the context of RL. ESBAS does this by using the ideas and strategies that emerged from the multi-armed bandit problem to select the best RL algorithm that maximizes the expected return on each episode.

The following topics will be covered in this chapter:

  • Exploration versus exploitation
  • Approaches to exploration
  • Epochal stochastic bandit algorithm selection
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset