Return

When running a policy in an MDP, the sequence of state and action (S0A0S1A1, ...) is called trajectory or rollout, and is denoted by . In each trajectory, a sequence of rewards will be collected as a result of the actions. A function of these rewards is called return and in its most simplified version, it is defined as follows:

At this point, the return can be analyzed separately for trajectories with infinite and finite horizons. This distinction is needed because in the case of interactions within an environment that do not terminate, the sum previously presented will always have an infinite value. This situation is dangerous because it doesn't provide any information. Such tasks are called continuing tasks and need another formulation of the reward. The best solution is to give more weight to the short-term rewards while giving less importance to those in the distant future. This is accomplished by using a value between 0 and 1 called the discount factor denoted with the symbol λThus, the return G can be reformulated as follows:

This formula can be viewed as a way to prefer actions that are closer in time with respect to those that will be encountered in the distant future. Take this example—you win the lottery and you can decide when you would like to collect the prize. You would probably prefer to collect it within a few days rather than in a few years.  is the value that defines how long you are willing to wait to collect the prize. If , that means that you are not bothered about when you collect the prize. If , that means that you want it immediately. 

In cases of trajectories with a finite horizon, meaning trajectories with a natural ending, tasks are called episodic (it derives from the term episode, which is another word for trajectory). In episodic tasks, the original formula (1) works, but nevertheless, it is preferred to have a variation of it with the discount factor:

With a finite but long horizon, the use of a discount factor increases the stability of algorithms, considering that long future rewards are only partially considered. In practice, discount factor values between 0.9 and 0.999 are used.

A trivial but very useful decomposition of formula (3) is the definition of return in terms of the return at timestep t + 1:

When simplifying the notation, it becomes the following: 

Then, using the return notation, we can define the goal of RL to find an optimal policy, , that maximizes the expected return as , where  is the expected value of a random variable.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset