The core

In the paper, a version of ES is used that maximizes the average objective value, as follows:

It does this by searching over a population, , that's parameterized by  with stochastic gradient ascent.  is the objective function (or fitness function) while  is the parameters of the actor. In our problems,  is simply the stochastic return that's obtained by the agent with  in the environment.

The population distribution, ,is a multivariate Gaussian with a mean, , and fixed standard deviation, , as follows:

From here, we can define the step update by using the stochastic gradient estimate, as follows:

With this update, we can estimate the stochastic gradient (without performing backpropagation) using the results of the episodes from the population. We can update the parameters using one of the well-known update methods, such as Adam or RMSProp as well.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset