User experience

Now, the first thing we need to clarify is how to sample from the environment, and how to interact with it to get usable information about its dynamics:

Figure 4.1. A trajectory that starts from state 

The simple way to do this is to execute the current policy until the end of the episode. You would end up with a trajectory as shown in figure 4.1. Once the episode terminates, the return values can be computed for each state by backpropagating upward the sum of the rewards, . Repeating this process multiple times (that is, running multiple trajectories) for every state would have multiple return values. The return values are then averaged for each state to compute the expected returns. The expected returns computed in such a way is an approximated value function. The execution of a policy until a terminal state is called a trajectory or an episode. The more trajectories are run, the more returns are observed and by the law of large numbers, the average of these estimations will converge to the expected value.

Like DP, the algorithms that learn a policy by direct interaction with the environment rely on the concepts of policy evaluation and policy improvement. Policy evaluation is the act of estimating the value function of a policy, while policy improvement uses the estimates made in the previous phase to improve the policy.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset