I'm using PPO with an action-mask and I'm encountering a weird phenomenon.
At first during training, the entropy loss is decreasing (I interpret this as less exploration, more exploitation, more "certainty" about policy) and my mean reward per episode increases. This is all exactly what I would expect.
Then, at a certain point, the entropy loss continues to decrease HOWEVER now the performance starts consistently decreasing as well. I've set up my code to decrease the learning rate when this happens (I've read that adaptively annealing the learning rate can help PPO), but the problem persists.
I do not understand why this would happen on a conceptual level, nor on a practical one. Any ideas, insights and advice would be greatly appreciated!
I run my model for ~75K training steps before checking its entropy and performance.
Here are all the parameters of my model:
Learning rate: 0.005, set to decrease by 1/2 every time performance drops during a check
Gamma: 0.975
Batch Size: 2048
Rollout Buffer Size: 4 parallel environments x 16,834 n_steps = ~65,500
n_epochs: 2
Network size: Both networks (actor and critic) are 352 x 352