
At each timestep, that is, after each move of the agent, the environment sends back a number that indicates how good that action was to the agent. This is called a reward. As we have already mentioned, the end goal of the agent is to maximize the cumulative reward obtained during their interaction with the environment.

In literature, the reward is assumed to be a part of the environment, but that's not strictly true in reality. The reward can come from the agent too, but never from the decision-making part of it. For this reason and to simplify the formulation, the reward is always sent from the environment.

The reward is the only supervision signal injected into the RL cycle and it is essential to design the reward in the correct way in order to obtain an agent with good behavior. If the reward has some flaws, the agent may find them and follow incorrect behavior. For example, Coast Runners is a boat-racing game with the goal being to finish ahead of other players. During the route, the boats are rewarded for hitting targets. Some folks at OpenAI trained an agent with RL to play it. They found that, instead of running to the finish line as fast as possible, the trained boat was driving in a circle to capture re-populating targets while crashing and catching fire. In this way, the boat found a way to maximize the total reward without acting as expected. This behavior was due to an incorrect balance between short-term and long-term rewards.

The reward can appear with different frequencies depending on the environment. A frequent reward is called a dense reward; however, if it is seen only a few times during a game, or only at its end, it is called a sparse reward. In the latter case, it could be very difficult for an agent to catch the reward and find the optimal actions.

Imitation learning and inverse RL are two powerful techniques that deal with the absence of a reward in the environment. Imitation learning uses an expert demonstration to map states to actions. On the other hand, inverse RL deduces the reward function from an expert optimal behavior. Imitation learning and inverse RL will be studied in Chapter 10, Imitation Learning with the DAgger Algorithm.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.