Approaches to exploration

Put simply, the multi-armed bandit problem, and in general every exploration problem, can be solved either through random strategies, or through smarter techniques. The most notorious algorithm that belongs to the first category, is called -greedy; whereas optimistic exploration, such as UCB, and posterior exploration, such as Thompson sampling, belong to the second category. In this section, we'll take a look particularly at the -greedy and UCB strategies. 

It's all about balancing the risk and the reward. But, how can we measure the quality of an exploration algorithm? Through regret. Regret is defined as the opportunity lost in one step that is, the regret, , at time, , is as follows:

Here,  denotes the optimal value, and  the action-value of

Thus, the goal is to find a trade-off between exploration and exploitation, by minimizing the total regret over all the actions:

Note that the minimization of the total regret is equivalent to the maximization of the cumulative reward. We'll use this idea of regret to show how exploration algorithms perform.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset