Exploration and exploitation

The exploration-exploitation trade-off is another problem that has its apparent origin within gambling, even though the real applications range from allocation of funding to research projects to self-driving cars. The traditional formulation is a multi-armed bandit problem, which refers to an imaginary slot machine with one or more arms. Sequential plays of each arm generate i.i.d . returns with unknown probabilities for each arm; the successive plays are independent in the simplified models. The rewards are assumed to be independent across the arms. The goal is to maximize the reward—for example, the amount of money won, and to minimize the learning loss, or the amount spend on the arms with less than optimal winning rate, provided an agreed upon arm selection policy. The obvious trade-off is between the exploration in search of an arm that produces the best return and exploitation of the best-known arm with optimal return:

Exploration and exploitation

The pseudo-regret is then the difference:

Exploration and exploitation

Here, Exploration and exploitation is the ith arm selection out of N trials. The multi-armed bandit problem was extensively studied in the 1930s and again during the early 2000s, with the application in finance and ADTECH. While in general, due to stochastic nature of the problem, it is not possible to provide a bound on the expected regret better than the square root of N, the pseudo-regret can be controlled so that we are able to bound it by a log of N (Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems by Sebastien Bubeck and Nicolo Cesa-Bianchi, http://arxiv.org/pdf/1204.5721.pdf).

One of the most common strategies used in practice is epsilon strategies, where the optimal arm is chosen with the probability of Exploration and exploitation and one of the other arms with the remaining probability. The drawback of this approach is that we might spend a lot of exploration resources on the arms that are never going to provide any rewards. The UCB strategy improves the epsilon strategy by choosing an arm with the largest estimate of the return, plus some multiple or fraction of the standard deviation of the return estimates. The approach needs the recomputation of the best arm to pull at each round and suffers from approximations made to estimate the mean and standard deviation. Besides, UCB requires the recomputation of the estimates for each successive pull, which might be a scalability problem.

Finally, the Thompson sampling strategy uses a fixed random sample from Beta-Bernoulli posterior estimates and assigns the next arm to the one that gives the minimal expected regret, for which real data can be used to avoid parameter recomputation. Although the specific numbers may depend on the assumptions, one available comparison for these model performances is provided in the following diagram:

Exploration and exploitation

Figure 02-3. The simulation results for different exploration exploitation strategies for K = 5, one-armed bandits, and different strategies.

Figure 02-3 shows simulation results for different strategies (taken from the Rich Relevance website at http://engineering.richrelevance.com/recommendations-thompson-sampling). The Random strategy just allocates the arms at random and corresponds to pure exploration. The Naive strategy is random up to a certain threshold and than switches to pure Exxploitation mode. Upper Confidence Bound (UCB) with 95% confidence level. UCB1 is a modification of UCB to take into account the log-normality of the distributions. Finally the Thompson sampling strategy makes a random sample from actual posterior distribution to optimize the regret.

Exploration/exploitation models are known to be very sensitive to the initial conditions and outliers, particularly on the low-response side. One can spend enormous amount of trials on the arms that are essentially dead.

Other improvements on the strategies are possible by estimating better priors based on additional information, such as location, or limiting the set of arms to explore—K—due to such additional information, but these aspects are more domain-specific (such as personalization or online advertising).

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset