Reinforcement Learning with PPO - entropy loss dropping, but so is performance. Why?

Question

I'm using PPO with an action-mask and I'm encountering a weird phenomenon.

At first during training, the entropy loss is decreasing (I interpret this as less exploration, more exploitation, more "certainty" about policy) and my mean reward per episode increases. This is all exactly what I would expect.

Then, at a certain point, the entropy loss continues to decrease HOWEVER now the performance starts consistently decreasing as well. I've set up my code to decrease the learning rate when this happens (I've read that adaptively annealing the learning rate can help PPO), but the problem persists.

I do not understand why this would happen on a conceptual level, nor on a practical one. Any ideas, insights and advice would be greatly appreciated!

I run my model for ~75K training steps before checking its entropy and performance.

Here are all the parameters of my model:

Learning rate: 0.005, set to decrease by 1/2 every time performance drops during a check
Gamma: 0.975
Batch Size: 2048
Rollout Buffer Size: 4 parallel environments x 16,834 n_steps = ~65,500
n_epochs: 2
Network size: Both networks (actor and critic) are 352 x 352

Is this an environment where the agent can stumble upon the "right solution" by trying out things until something works? Conceptually, it could be that when the policy is still stochastic enough (high entropy), the actions that "solve" the environment are chosen after enough tries. But during your training entropy decreases quickly enough that the policy is close to deterministic (low entropy) before the agent has had a chance to reliably learn the right actions, instead converging on the "wrong" ones and choosing them consistently. — mikkola
– mikkola, Commented Sep 20, 2022 at 14:01
@mikkola Yes, the agent can definitely get rewards through random actions. I understand your overall point, but I guess I don't understand why the agent would converge on the wrong ones, rather than the right ones, given that both are available since the reward is initially high? — Vladimir Belik
– Vladimir Belik, Commented Sep 20, 2022 at 14:28
Does the agent respond to the environment change as a policy change? It happened to me that due to insufficient training of the model using the SGD optimizer, the entropy of behavior decreased, that is, the probability of some actions increased significantly in relation to others, but this probability did not depend on the environment in any way. The problem was solved by choosing the Adam optimizer — Андрей Петров
– Андрей Петров, Commented Jan 9, 2023 at 11:40

Kostya · Accepted Answer · 2023-01-09 12:23:01Z

PPO is an algorithm in a class of actor-critic methods. In this class of methods, the training is broken in two counteracting parts: the "actor" parts learns the policy and the "critic" part learns the value (advantage) function for that policy. Such approach suffers from a fundamental instability inherent to all adversarial optimization methods - if one part of the adversarial loop gets way too ahead of another part, then the whole training loop stops.

That looks like what is happening in your description - your "actor" bit got too certain about the policy so the "critic" bit is unable to progress to better learn the value (advantage) function of the environment.

Stack Exchange Network

Reinforcement Learning with PPO - entropy loss dropping, but so is performance. Why?

1 Answer 1

You must log in to answer this question.

Hot Network Questions

Reinforcement Learning with PPO - entropy loss dropping, but so is performance. Why?

1 Answer 1

You must log in to answer this question.

Related

Hot Network Questions