Contextual bandits

We can now elevate the single multi-armed bandit problem into a problem with multiple multi-armed bandits, each with its own set of arms. Now our problem introduces context or state into the equation. With each bandit defining its own context/state, now we evaluate our equation in terms of quality and action. Our modified equation is shown here:

Consider the following equation:

= table/matrix of values

[1,2,3,4

2,3,4,5

4,2,1,4]

= state
= action
= alpha = learning rate
= reward

Let's open up Chapter_5_2.py and observe the following steps:

Open the code up, as follows, and follow the changes made from the previous sample:

import random

alpha = .9
bandits = [[['bronze' , 1],['gold', 3], ['silver' , 2], ['bronze' , 1]],
           [['bronze' , 1],['gold', 3], ['silver' , 2], ['bronze' , 1]],
           [['bronze' , 1],['gold', 3], ['silver' , 2], ['bronze' , 1]],
           [['bronze' , 1],['gold', 3], ['silver' , 2], ['bronze' , 1]]]
q = [[0,0,0,0],
     [0,0,0,0],
     [0,0,0,0],
     [0,0,0,0]]

for i in range(10): 
    for b in range(len(bandits)):
        arm = random.randint(0,3)
        print('pulling arm {0} on bandit {1}'.format(arm,b))
        q[b][arm] = q[b][arm] + alpha * (bandits[b][arm][1]-q[b][arm])

print(q)

This code sets up a number of multi-armed bandits, each with its own set of arms. It then iterates through a number of iterations, but this time as it loops, it also loops through each bandit. During each loop, it picks a random arm to pull and evaluates the quality.
Run the sample and look at the output of q. Note how, even after selecting random arms, the equation again consistently selected the gold arm, the arm with the highest reward, to pull.

Feel free to play around with this sample some more and look to the exercises for additional inspiration. We will expand on the complexity of our RL problems when we discuss Q-Learning. However, before we get to that section, we will take a quick diversion and look at setting up the OpenAI Gym in order to conduct more RL experiments.

Table of Contents for Contextual bandits

Create new playlist

Sign In

Sign Up

Table of Contents for
Contextual bandits