Deep neural networks and Q-learning

The Q-learning algorithm, as we saw in Chapter 4, Q-Learning and SARSA Applications, has many qualities that enable its application in many real-world contexts. A key ingredient of this algorithm is that it makes use of the Bellman equation for learning the Q-function. The Bellman equation, as used by the Q-learning algorithm, enables the updating of Q-values from subsequent state-action values. This makes the algorithm able to learn at every step, without waiting until the trajectory is completed. Also, every state or action-state pair has its own values stored in a lookup table that saves and retrieves the corresponding values. Being designed in this way, Q-learning converges to optimal values as long as all the state-action pairs are repeatedly sampled. Furthermore, the method uses two policies: a non-greedy behavior policy to gather experience from the environment (for example, -greedy) and a target greedy policy that follows the maximum Q-value.

Maintaining a tabular representation of values can be contraindicated and in some cases, harmful. That's because most problems have a very high number of states and actions. For example, images (including small ones) have more state than the atoms in the universe. You can easily guess that, in this situation, tables cannot be used. Besides the infinite memory that the storage of such a table requires, only a few states will be visited more than once, making learning about the Q-function or V-function extremely difficult. Thus, we may want to generalize across states. In this case, generalization means that we are not only interested in the precise value of a state, V(s), but also in the values in similar and near states. If a state has never been visited, we could approximate it with the value of a state near it. Generally speaking, the concept of generalization is incredibly important in all machine learning, including reinforcement learning. 

The concept of generalization is fundamental in circumstances where the agent doesn't have a complete view of the environment. In this case, the full state of the environment will be hidden by the agent that has to make decisions based solely on a restricted representation of the environment. This is known as observation. For example, think about a humanoid agent that deals with basic interactions in the real world. Obviously, it doesn't have a view of the complete state of the universe and of all the atoms. It only has a limited viewpoint, that is, observation, which is perceived by its sensors (such as video cameras). For this reason, the humanoid agent should generalize what's happening around it and behave accordingly.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset