Exploration versus exploitation

The exploration-exploitation trade-off dilemma, or exploration-exploitation problem, affects many important domains. Indeed, it's not only restricted to the RL context, but applies to everyday life. The idea behind this dilemma is to establish whether it is better to take the optimal solution that is known so far, or if it's worth trying something new. Let's say you are buying a new book. You could either choose a title from your favorite author, or buy a book of the same genre that Amazon is suggesting to you. In the first case, you are confident about what you're getting, but by selecting the second option, you don't know what to expect. However, in the latter case, you could be incredibly pleased, and end up reading a very good book that is indeed better than the one written by your favorite author.

This conflict between exploiting what you have already learned and taking advantage of it or exploring new options and taking some risks, is very common in reinforcement learning as well. The agent may have to sacrifice a short-term reward, and explore a new space, in order to achieve a higher long-term reward in the future. 

All this may not sound new to you. In fact, we started dealing with this problem when we developed the first RL algorithm. Up until now, we have primarily adopted simple heuristics, such as the -greedy strategy, or followed a stochastic policy to decide whether to explore or exploit. Empirically, these strategies work very well, but there are some other techniques that can achieve theoretical optimal performance. 

In this chapter, we'll start with an explanation of the exploration-exploitation dilemma from the ground up, and introduce some exploration algorithms that achieve nearly-optimal performance on tabular problems. We'll also show how the same strategies can be adapted to non-tabular and more complex tasks.

For an RL algorithm, one of the most challenging Atari games to solve is Montezuma's Revenge, rendered in the following screenshot. The objective of the game is to score points by gathering jewels and killing enemies. The main character has to find all the keys in order to navigate the rooms in the labyrinth, and gather the tools that are needed to move around, while avoiding obstacles. The sparse reward, the long-term horizon, and the partial rewards, which are not correlated with the end goal, make the game very challenging for every RL algorithm. Indeed, these four characteristics make Montezuma's Revenge one of the best environments for testing exploration algorithms:

Screenshot of Montezuma's Revenge

Let's start from the ground up, in order to give a complete overview of this area.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset