Q-learning and exploration

One problem we face with the policy iterative models such as Q-learning is the problem of exploration versus exploitation. The Q-model equation assumes the use of maximum quality to determine an action and we refer to this as exploitation (exploiting the model). The problem with this is that it can often corner an agent into a solution that only looks for the best short-term benefits. Instead, we need to allow the agent some flexibility to explore the environment and learn on its own. We do this by introducing a dissolving exploration factor into the training. Let's see how this looks by again opening up the Chapter_5_3.py example:

Scroll down to the act and is_explore functions as shown:

def is_explore():
    global epsilon, epsilon_decay, epsilon_min
    epsilon = max(epsilon-epsilon_decay,epsilon_min)
    if np.random.rand() < epsilon:
        return True
    else:
        return False

def act(action_space, state):
    # 0 - left, 1 - Down, 2 - Right, 3 - Up
    global q_table
    if is_explore():
        return action_space.sample()
    else:
        return np.argmax(q_table[state])

Note that in the act function, it first tests whether the agent wants to or needs to explore with is_explore(). In the is_explore function, we can see that the global epsilon value is decayed over each iteration with epsilon_decay to a global minimum value, epsilon_min. When the agent starts an episode, their exploration epsilon is high, making them more probable to explore. Over time, as the episode progresses, the epsilon decreases. We do in with the assumption that over time the agent will need to explore less and less. This trade-off between exploration and exploitation is quite important and something to understand with respect to the size of the environment state. We will see this trade-off explored more throughout this book.
Note that the agent uses an exploration function and just selects a random action.
Finally, we get to the learn function. This function is where the Q value is calculated, as follows:

def learn(state, action, reward, next_state):
    # Q(s, a) += alpha * (reward + gamma * max_a' Q(s', a') - Q(s, a))
    global q_table
    q_value = gamma * np.amax(q_table[next_state])
    q_value += reward
    q_value -= q_table[state, action]
    q_value *= learning_rate
    q_value += q_table[state, action]
    q_table[state, action] = q_value

Here, the equation is broken up and simplified, but this is the step that calculates the value the agent will use when exploiting.

Keep the agent running until it finishes. We just completed the first full reinforcement learning problem, albeit the one that had a finite state. In the next section, we greatly expand our horizons and look at deep learning combined with reinforcement learning.

Table of Contents for Q-learning and exploration

Create new playlist

Sign In

Sign Up

Table of Contents for
Q-learning and exploration