The ∈-greedy strategy

We have already expanded the ideas behind the -greedy strategy and implemented it to help our exploration in algorithms such as Q-learning and DQN. It is a very simple approach, and yet it achieves very high performance in non-trivial jobs as well. This is the main reason behind its widespread use in many deep learning algorithms.

To refresh your memory, -greedy takes the best action most of the time, but from time to time, it selects a random action. The probability of choosing a random action is dictated by the value, which ranges from 0 to 1. That is, with probability, the algorithm will exploit the best action, and with  probability, it will explore the alternatives with a random selection.

In the multi-armed bandit problem, the action values are estimated based on past experiences, by averaging the reward obtained by taking those actions:

In the preceding equation,  is the number of times that the action has been picked, and  is a Boolean that indicates whether at time action  has been chosen. The bandit will then act according to the -greedy algorithm, and explore by choosing a random action, or exploit by picking the action with the higher  value.

A drawback of -greedy, is that it has an expected linear regret. But, for the law of large numbers, the optimal expected total regret should be logarithmic to the number of timesteps. This means that the -greedy strategy isn't optimal.

A simple way to reach optimality involves the use of an  value that decays as time goes by. By doing this, the overall weight of the exploration will vanish, until only greedy actions will be chosen. Indeed, in deep RL algorithms -greedy is almost always combined with a linear, or exponential decay of .

That being said,  and its decay rate is difficult to choose, and there are other strategies that solve the multi-armed bandit problem optimally.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset