Q-learning

Now, we come to the real meat of our system: the Q-value function. This includes the cumulative expected reward for actions a1, a2, and a given state, s. We are, of course, interested in finding the optimal Q-value function. This means that not only do we have a given (s, a), but we have trainable parameters (the sum of the product) of the weights and biases in our DQN that we modify or update as we train our network. These parameters allow us to define an optimal policy, that is, a function to apply to any given states and actions available to the agent. This yields an optimal Q-value function, one that theoretically tells our agent what the best course of action is at any step. A bad football analogy might be the Q-value function as the coach yelling instructions into the rookie agent's ear.

So, when written in pseudocode, our quest for an optimal policy looks like this:

OptimalPolicy = (state, action, theta)

Here, theta refers to the trainable parameters of our DQN.

So, what is a DQN? Let's now examine the structure of our network in detail and, more importantly, how it is used. Here, we will bring in our Q-value functions and use our neural network to calculate the expected reward for a given state.

Like the networks we have covered so far, there are a number of hyperparameters we set upfront:

Gamma (the discount factor of future rewards, for example, 0.95)
Epsilon (exploration or exploitation, 1.0, skewed to exploration)
Epsilon decay (the shift to the exploitation of learned knowledge over time, for example, 0.995)
Epsilon decay minimum (for example, 0.01)
Learning rate (this is still set by default despite using the Adaptive Moment Estimation (Adam))
State size
Action size
Batch size (in powers of two; start with 32 and tune your way from there)
Number of episodes

We also need a fixed sequential memory for the experience replay feature, sizing it at 2,000 entries.

Table of Contents for Q-learning

Create new playlist

Sign In

Sign Up

Table of Contents for
Q-learning