Results on RoboSchoolInvertedPendulum

The performance graph is shown in the following diagram:

The reward is plotted as a function of the number of interactions with the real environment. After 900 steps and about 15 games, the agent achieves the top performance of 1,000. The policy updated itself 15 times and learned from 750,000 simulated steps. From a computational point of view, the algorithm trained for about 2 hours on a mid-range computer.

We noted that the results have very high variability and, if trained with different random seeds, you can obtain very different performance curves. This is also true for model-free algorithms, but here, the differences are more acute. One reason for this may be the different data collected in the real environment. 

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset