Curiosity Learning

Up until now, we have considered just the extrinsic or external rewards an agent may receive in an environment. The Hallway example, for instance, gives a +1 external reward when the agent reaches the goal, and a -1 external reward if it gets the wrong goal. However, real animals like us can actually learn based on internal motivations, or by using an internal reward function. A great example of this is a baby (a cat, a human, or whatever) that has an obvious natural motivation to be curious through play. The curiosity of playing provides the baby with an internal or intrinsic reward, but the actual act itself gives it a negative external or extrinsic reward. After all, the baby is expending energy, a negative external reward, yet it plays on and on in order to learn more general information about its environment. This, in turn, allows it to explore more of the environment and ultimately attain some very difficult goal, such as hunting, or going to work.

This form of internal or intrinsic reward modeling falls into a subclass of RL, called Motivated Reinforcement Learning. As you may well imagine, this whole arc of learning could have huge applications in gaming, from creating NPCs to more believable opponents that actually get motivated by some personality trait or emotion. Imagine having a computer opponent that can get angry, or even, compassionate? Of course, we are a long way from getting there, but in the interim, Unity has added an intrinsic reward system in order to model agent curiosity, and this is called Curiosity Learning.

Curiosity Learning (CL) was first developed by researchers at the University of California, Berkley, in a paper called Curiosity-Driven Exploration by Self-Supervised Prediction, which you can find at https://pathak22.github.io/noreward-rl/. The paper goes on to describe a system of solving sparse rewards problems using forward and inverse neural networks. They called the system an Intrinsic Curiosity Module (ICM), with the intent for it to be used as a layer or module on top of other RL systems. This is exactly what Unity did, and they have added this as a module to ML-Agents.

The Lead Researcher at Unity, Dr. Arthur Juliani, has an excellent blog post on their implementation that can be found at https://blogs.unity3d.com/2018/06/26/solving-sparse-reward-tasks-with-curiosity/.

ICM works by using an inverse neural network that is trained using the current and next observation of the agent. It uses an encoder to encode a prediction on what the action was between the two states, current and next. Then, the forward network is trained on the current observation and action in which it encodes to the next observation. The difference is then taken between the real and predicted encodings from the inverse and forward models. In this case, the bigger the difference, the bigger the surprise, and the more intrinsic the rewards. A diagram extracted from Dr. Juliani's blog is shown as follows, describing how this works:



Inner workings of the Curiosity Learning Module

The diagram shows the depiction of the two models and layers in blue, forward and inverse, with the blue lines depicting network flow, the green box representing the intrinsic model calculation, and the reward output in the form of the green dotted lines. 

Well, that's enough theory, its time to see how this CL works in practice. Fortunately, Unity has a very well developed environment that features this new module that is called Pyramids. Let's open Unity and follow the next exercise to see this environment in action:

  1. Open the Pyramid scene from the Assets | ML-Agents | Examples | Pyramids | Scenes folder.
  1. Select the AreaPB(1) to AreaPB(15) in the Hierarchy window and then deactivate these objects in the Inspector window.
  2. Leave the scene in player mode. For the first time, we want you to play the scene on your own and figure out the goal. Even if you read the blog or played the scene, try again, but this time, think what reward functions would need to be in place.
  1. Press Play in the editor and start playing the game in Player mode. If you have not played the game before or understand the premise, don't be surprised if it takes you a while to solve the puzzle.

Now, for those of you that didn't read or play ahead, here is the premise. The scene starts where the agent is randomly placed into an area of rooms with pyramids of stone in which one has a switch. The goal of the agent is to activate the switch that then spawns a pyramid of sand boxes with a large gold box on top. The switch turns from red to green after it is activated. After the pyramid appears, the agent then needs to knock the pyramid over and retrieve the gold box. It certainly is not the most complex of puzzles, but one that does require a bit of exploration and curiosity.

Imagine if we tried to model this form of curiosity, or need to explore, with a set of reward functions. We would need a reward function for activating the button, moving to rooms, knocking over blocks, and, of course, getting the gold box. Then we would have to determine the value of each of those objectives, perhaps using some form of Inverse Reinforcement Learning (IRL). However, with Curiosity Learning, we can create the reward function for just the end goal of getting the box (+1), and perhaps a small negative step goal (.0001), then use intrinsic curiosity rewards to let the agent learn the remaining steps. Quite a clever trick, and we will see how this works in the next section.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset