The multi-armed bandit

The diagram we saw earlier describes the full RL problem as we will use for most of the rest of this book. However, we often teach a simpler one-step variation of this problem called the multi-armed bandit. The armed bandit is in reference to the Vegas slot machine and nothing more nefarious. We use these simpler scenarios in order to explain the basics of RL in the form of a one-step or one-state problem. 

In the case of the multi-armed bandit, picture a fictional multi-armed Vegas slot machine that awards different prizes based on which arm is pulled, but the prize for each arm is always the same. The agent's goal in this scenario would be to figure out the correct arm to pull every time. We could further model this in an equation such as the one shown here:


Consider the following equation:
  •  = vector of values (1,2,3,4)
  •  = action
  •  = alpha = learning rate
  •  = reward

This equation calculates the value (V), a vector, for each action the agent takes. Then, it feeds back these values into itself, subtracted from the reward and multiplied by a learning rate. This calculated value can be used to determine which arm to pull, but first the agent needs to pull each arm at least once. Let's quickly model this in code, so as game/simulation programmers, we can see how this works. Open the Chapter_5_1.py code and follow these steps:

  1. The code for this exercise is as follows:
alpha = .9
arms = [['bronze' , 1],['gold', 3], ['silver' , 2], ['bronze' , 1]]
v = [0,0,0,0]

for i in range(10):
for a in range(len(arms)):
print('pulling arm '+ arms[a][0])
v[a] = v[a] + alpha * (arms[a][1]-v[a])

print(v)

  1. This code creates the required setup variables, the arms (gold, silver, and bronze), and the value vector v (all zeros). Then, the code loops through a number of iterations (10) where each arm is pulled and the value, v, is calculated and updated based on the equation. Note that the reward value is replaced by the value of the arm pull, which is the term arms[a][1].
  2. Run the example, and you will see the output generated showing the value for each action, or in this case an arm pull.

As we saw, with a simple equation, we were able to model the multi-armed bandit problem and arrive at a solution that will allow an agent to consistently pull the correct arm. This sets the foundation for RL, and in the next section, we take the next step and look at contextual bandits.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset