Multi-armed bandit

The multi-armed bandit problem is the classic RL problem that is used to illustrate the exploration-exploitation trade-off dilemma. In the dilemma, an agent has to choose from a fixed set of resources, in order to maximize the expected reward. The name multi-armed bandit comes from a gambler that is playing multiple slot machines, each with a stochastic reward from a different probability distribution. The gambler has to learn the best strategy in order to achieve the highest long-term reward.

This situation is illustrated in the following diagram. In this particular example, the gambler (the ghost) has to choose one of the five slot machines, all with different and unknown reward probabilities, in order to win the highest amount of money:

Example of a five-armed bandit problem

If you are questioning how the multi-armed bandit problem relates to more interesting tasks such as Montezuma's Revenge, the answer is that they are all about deciding whether, in the long run, the highest reward is yielded when new behaviors are attempted (pulling a new arm), or when continuing to do the best thing done so far (pulling the best-known arm). However, the main difference between the multi-armed bandit and Montezuma's Revenge is that, in the latter, the state of the agent changes every time. In the multi-armed bandit problem, there's only one state, and there's no sequential structure, meaning that past actions will not influence the future.

So, how can we find the right balance between exploration and exploitation in the multi-armed bandit problem?

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset