Intrinsic reward

A potential fair alternative is to develop a reward function that is intrinsic to the agent, meaning that it's controlled exclusively by the belief of the agent. This method comes close to the approach that's used by newborns to learn. In fact, they employ a pure explorative paradigm to navigate the world without an immediate benefit. Nonetheless, the knowledge that's acquired may be useful later in life.

The intrinsic reward is a sort of exploration bonus based on the estimation of the novelty of a state. The more unfamiliar a state is, the higher the intrinsic reward. Thus, with it, the agent is incentivized to explore new spaces of the environment. It may have become clear by now that the intrinsic reward can be used as an alternative exploration strategy. In fact, many algorithms use it in combination with the extrinsic reward (that is the usual reward that's returned by the environment) to boost the exploration in very sparse environments such as Montezuma's revenge. However, though the methods to estimate the intrinsic reward are very similar to those we studied in Chapter 12Developing ESBAS Algorithm, to incentivize policy exploration (these exploration strategies were still related to the extrinsic reward), here, we are only concentrating on pure unsupervised exploration methods.

Two primary curiosity-driven strategies that provide rewards on unfamiliar states and explore the environment efficiently are count-based and dynamics-based:

  • Count-based strategies (also known as visitation counts strategies) aim to count or estimate the visitation count of each state and encourage the exploration of those states with low visitation, assigning a high intrinsic reward to them.
  • Dynamics-based strategies train a dynamic model of the environment, along with the agent's policy, and compute the intrinsic reward either on the prediction error, on the prediction uncertainty, or on the prediction improvement. The underlying idea is that by fitting a model on the states visited, the new and unfamiliar states will have a higher uncertainty or estimation error. These values are then used to compute the intrinsic reward and incentivize the exploration of unknown states.

What happens if we apply only curiosity-driven approaches to the usual environments? The paper Large-scale study of curiosity-driven learning addressed this question and found that, on Atari games, pure curiosity-driven agents can learn and master the tasks without any external reward. Furthermore, they noted that, on Roboschool, walking behavior emerged purely out of these unsupervised algorithms based on intrinsic reward. The authors of the paper also suggested that these findings were due to the way in which the environments have been designed. Indeed, in human-designed environments (such as games), the extrinsic reward is often aligned with the objective of seeking novelty. Nonetheless, in environments that are not gamified, pure curiosity-driven unsupervised approaches are able to explore and learn about the environment exclusively by themselves without any need for supervision whatsoever. Alternatively, RL algorithms can also benefit from a huge boost in exploration and consequently in performance by combining the intrinsic with the extrinsic reward.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset