There are several further advancements of AC algorithms, and there are many tips and tricks to keep in mind, while designing such algorithms:
- Architectural design: In our implementation, we implemented two distinct neural networks, one for the critic, and one for the actor. It's also possible to design a neural network that shares the main hidden layers, while keeping the heads distinct. This architecture can be more difficult to tune, but overall, it increases the efficiency of the algorithms.
- Parallel environments: A widely adopted technique to decrease the variance is to collect experience from multiple environments in parallel. The A3C (Asynchronous Advantage Actor-Critic) algorithm updates the global parameters asynchronously. Instead, the synchronous version of it, called A2C (Advantage Actor-Critic) waits for all of the parallel actors to finish before updating the global parameters. The agent parallelization ensures more independent experience from different parts of the environment.
- Batch size: With respect to other RL algorithms (especially off-policy algorithms), policy gradient and AC methods need large batches. Thus, if after tuning the other hyperparameters, the algorithm doesn't stabilize, consider using a larger batch size.
- Learning rate: Tuning the learning rate in itself is very tricky, so make sure that you use a more advanced SGD optimization method, such as Adam or RMSprop.