A brief recap of RL

In the beginning, a policy is initialized randomly and used to interact with the environment for either a given number of steps, or entire trajectories, to collect data. On each interaction, the state visited, the action taken, and the reward obtained are recorded. This information provides a full description of the influence of the agent in the environment. Then, in order to improve the policy, the backpropagation algorithm (based on the loss function, in order to move the predictions to a better estimate) computes the gradient of each weight of the network. These gradients are then applied with a stochastic gradient descent optimizer. This process (gathering data from the environment and optimizing the neural network with stochastic gradient descent (SGD)) is repeated until a convergence criterion is met.

There are two important things to note here that will be useful in the following discussion:

  • Temporal credit assignment: Because RL algorithms optimize the policy on each step, allocating the quality of each action and state is required. This is done by assigning a value to each state-action pair. Moreover, a discount factor is used to minimize the influence of distant actions and to give more weight to the last actions. This will help us solve the problem of assigning the credit to the actions, but will also introduce inaccuracies in the system.
  • Exploration: In order to maintain a degree of exploration in the actions, additional noise is injected into the policy of RL algorithms. The way in which the noise is injected depends on the algorithm, but usually, the actions are sampled from a stochastic distribution. By doing so, if the agent is in the same situation twice, it may take different actions that would lead to two different paths. This strategy also encourages exploration in deterministic environments. By deviating the path each time, the agent may discover different – and potentially better – solutions. With this additional noise that asymptotically tends to 0, the agent is then able to converge to a better and final deterministic policy. 

But are backpropagation, temporal credit assignment, and stochastic actions actually a prerequisite for learning and building complex policies?

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset