Q-learning and exploration

One problem we face with the policy iterative models such as Q-learning is the problem of exploration versus exploitation. The Q-model equation assumes the use of maximum quality to determine an action and we refer to this as exploitation (exploiting the model). The problem with this is that it can often corner an agent into a solution that only looks for the best short-term benefits. Instead, we need to allow the agent some flexibility to explore the environment and learn on its own. We do this by introducing a dissolving exploration factor into the training. Let's see how this looks by again opening up the Chapter_5_3.py example:

  1. Scroll down to the act and is_explore functions as shown:
def is_explore():
global epsilon, epsilon_decay, epsilon_min
epsilon = max(epsilon-epsilon_decay,epsilon_min)
if np.random.rand() < epsilon:
return True
else:
return False

def act(action_space, state):
# 0 - left, 1 - Down, 2 - Right, 3 - Up
global q_table
if is_explore():
return action_space.sample()
else:
return np.argmax(q_table[state])

  1. Note that in the act function, it first tests whether the agent wants to or needs to explore with is_explore(). In the is_explore function, we can see that the global epsilon value is decayed over each iteration with epsilon_decay to a global minimum value, epsilon_min. When the agent starts an episode, their exploration epsilon is high, making them more probable to explore. Over time, as the episode progresses, the epsilon decreases. We do in with the assumption that over time the agent will need to explore less and less. This trade-off between exploration and exploitation is quite important and something to understand with respect to the size of the environment state. We will see this trade-off explored more throughout this book.
    Note that the agent uses an exploration function and just selects a random action.
  2. Finally, we get to the learn function. This function is where the Q value is calculated, as follows:
def learn(state, action, reward, next_state):
# Q(s, a) += alpha * (reward + gamma * max_a' Q(s', a') - Q(s, a))
global q_table
q_value = gamma * np.amax(q_table[next_state])
q_value += reward
q_value -= q_table[state, action]
q_value *= learning_rate
q_value += q_table[state, action]
q_table[state, action] = q_value
  1. Here, the equation is broken up and simplified, but this is the step that calculates the value the agent will use when exploiting.

Keep the agent running until it finishes. We just completed the first full reinforcement learning problem, albeit the one that had a finite state. In the next section, we greatly expand our horizons and look at deep learning combined with reinforcement learning.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset