Advanced AC, and tips and tricks

There are several further advancements of AC algorithms, and there are many tips and tricks to keep in mind, while designing such algorithms:

Architectural design: In our implementation, we implemented two distinct neural networks, one for the critic, and one for the actor. It's also possible to design a neural network that shares the main hidden layers, while keeping the heads distinct. This architecture can be more difficult to tune, but overall, it increases the efficiency of the algorithms.
Parallel environments: A widely adopted technique to decrease the variance is to collect experience from multiple environments in parallel. The A3C (Asynchronous Advantage Actor-Critic) algorithm updates the global parameters asynchronously. Instead, the synchronous version of it, called A2C (Advantage Actor-Critic) waits for all of the parallel actors to finish before updating the global parameters. The agent parallelization ensures more independent experience from different parts of the environment.
Batch size: With respect to other RL algorithms (especially off-policy algorithms), policy gradient and AC methods need large batches. Thus, if after tuning the other hyperparameters, the algorithm doesn't stabilize, consider using a larger batch size.
Learning rate: Tuning the learning rate in itself is very tricky, so make sure that you use a more advanced SGD optimization method, such as Adam or RMSprop.

Table of Contents for Advanced AC, and tips and tricks