We applied AC to LunarLander-v2, the same environment used for testing REINFORCE. It is an episodic game, and as such, it doesn't fully emphasize the main qualities of the AC algorithm. Nonetheless, it provides a good testbed, and you can freely test it in another environment.
We call the AC function with the following hyperparameters:
AC('LunarLander-v2', hidden_sizes=[64], ac_lr=4e-3, cr_lr=1.5e-2, gamma=0.99, steps_per_epoch=100, num_epochs=8000)
The resulting plot that shows the total reward accumulated in the training epochs is as follows:
You can see that AC is faster than REINFORCE, as shown in the following plot. However, it is less stable, and after about 200,000 steps, the performance declines a little bit, fortunately continuing to increment afterward:
In this configuration, the AC algorithm updates the actor and critic every 100 steps. In theory, you could use a smaller steps_per_epochs but, usually, it makes the training more unstable. Using a longer epoch can stabilize the training, but the actor learns more slowly. It's all about finding a good trade-off and good learning rates.