Approaches to exploration

Put simply, the multi-armed bandit problem, and in general every exploration problem, can be solved either through random strategies, or through smarter techniques. The most notorious algorithm that belongs to the first category, is called -greedy; whereas optimistic exploration, such as UCB, and posterior exploration, such as Thompson sampling, belong to the second category. In this section, we'll take a look particularly at the -greedy and UCB strategies.

It's all about balancing the risk and the reward. But, how can we measure the quality of an exploration algorithm? Through regret. Regret is defined as the opportunity lost in one step that is, the regret, , at time, , is as follows:

Here, denotes the optimal value, and the action-value of .

Thus, the goal is to find a trade-off between exploration and exploitation, by minimizing the total regret over all the actions:

Note that the minimization of the total regret is equivalent to the maximization of the cumulative reward. We'll use this idea of regret to show how exploration algorithms perform.

Table of Contents for Approaches to exploration

Create new playlist

Sign In

Sign Up

Table of Contents for
Approaches to exploration