Summary

In this chapter, we introduced a new family of RL algorithms that learn from experience by interacting with the environment. These methods differ from dynamic programming in their ability to learn a value function and consequently a policy without relying on the model of the environment.

Initially, we saw that Monte Carlo methods are a simple way to sample from the environment but because they need the full trajectory before starting to learn, they are not applicable in many real problems. To overcome these drawbacks, bootstrapping can be combined with Monte Carlo methods, giving rise to so-called temporal difference (TD) learning. Thanks to the bootstrapping technique, these algorithms can learn online (one-step learning) and reduce the variance while still converging to optimal policies. Then, we learned two one-step, tabular, model-free TD methods, namely SARSA and Q-learning. SARSA is on-policy because it updates a state value by choosing the action based on the current policy (the behavior policy). Q-learning, instead, is off-policy because it estimates the state value of a greedy policy while collecting experience using a different policy (the behavior policy). This difference between SARSA and Q-learning makes the latter slightly more robust and efficient than the former.

Every TD method needs to explore the environment in order to know it well and find the optimal policies. The exploration of the environment is in the hands of the behavior policy, which occasionally has to act non-greedily, for example, by following an -greedy policy.

We implemented both SARSA and Q-learning and applied them to a tabular game called Taxi. We saw that both converge to the optimal policy with similar results.

The Q-learning algorithm is key in RL because of its qualities. Moreover, through careful design, it can be adapted to work with very complex and high-dimensional games. All of this is possible thanks to the use of function approximations such as deep neural networks. In the next chapter, we'll elaborate on this, and introduce a deep Q-network that can learn to play Atari games directly from pixels.

Table of Contents for Summary

Create new playlist

Sign In

Sign Up

Table of Contents for
Summary