Exploration complexity

We saw how UCB, and in particular UCB1, can reduce the overall regret and accomplish an optimal convergence on the multi-armed bandit problem with a relatively easy algorithm. However, this is a simple stateless task.

So, how will UCB perform on more complex tasks? To answer this question, we can oversimplify the division and group all of the problems in these three main categories:

  • Stateless problems: An instance of these problems is the multi-armed bandit. The exploration in such cases can be handled with a more sophisticated algorithm, such as UCB1.
  • Small-to-medium tabular problems: As a basic rule, exploration can still be approached with more advanced mechanisms, but in some cases, the overall benefit is small, and is not worth the additional complexity.
  • Large non-tabular problems: We are now in more complex environments. In these settings, the outlook isn't yet well defined, and researchers are still actively working to find the best exploration strategy. The reason for this is that as the complexity increases, optimal methods such as UCB are intractable. For example, UCB cannot deal with problems with continuous states. However, we don't have to throw everything away, and we can use the exploration algorithms that were studied in the multi-armed bandit context as inspiration. That said, there are many approaches that approximate optimal exploration methods, and that work well in continuous environments, as well. For example, counting-based approaches, such as UCB, have been adapted with infinite state problems, by providing similar counts for similar states. An algorithm of these has also been capable of achieving significant improvement in very difficult environments, such as Montezuma's Revenge. Still, in the majority of RL contexts, the additional complexity that these more complex approaches involve is not worth it, and simpler random strategies such as -greedy work just fine.
It's also worth noting that, despite the fact that we outlined only a count-based approach to exploration such as UCB1, there are two other sophisticated ways in which to deal with exploration, which achieve optimal value in regret. The first is called posterior sampling (an example of this is Thompson sampling), and is based on a posterior distribution, and the second is called information gain, and relies upon an internal measurement of the uncertainty through the estimation of entropy. 
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset