Timeline for Reinforcement Learning with PPO - entropy loss dropping, but so is performance. Why?
Current License: CC BY-SA 4.0
5 events
| when toggle format | what | by | license | comment | |
|---|---|---|---|---|---|
| Jan 9, 2023 at 12:23 | answer | added | Kostya | timeline score: 3 | |
| Jan 9, 2023 at 11:40 | comment | added | Андрей Петров | Does the agent respond to the environment change as a policy change? It happened to me that due to insufficient training of the model using the SGD optimizer, the entropy of behavior decreased, that is, the probability of some actions increased significantly in relation to others, but this probability did not depend on the environment in any way. The problem was solved by choosing the Adam optimizer | |
| Sep 20, 2022 at 14:28 | comment | added | Vladimir Belik | @mikkola Yes, the agent can definitely get rewards through random actions. I understand your overall point, but I guess I don't understand why the agent would converge on the wrong ones, rather than the right ones, given that both are available since the reward is initially high? | |
| Sep 20, 2022 at 14:01 | comment | added | mikkola | Is this an environment where the agent can stumble upon the "right solution" by trying out things until something works? Conceptually, it could be that when the policy is still stochastic enough (high entropy), the actions that "solve" the environment are chosen after enough tries. But during your training entropy decreases quickly enough that the policy is close to deterministic (low entropy) before the agent has had a chance to reliably learn the right actions, instead converging on the "wrong" ones and choosing them consistently. | |
| Sep 19, 2022 at 20:13 | history | asked | Vladimir Belik | CC BY-SA 4.0 |