Timeline for Reinforcement Learning with PPO - entropy loss dropping, but so is performance. Why?

Current License: CC BY-SA 4.0

5 events

when toggle format	what		by	license	comment
Jan 9, 2023 at 12:23	answer	added	Kostya		timeline score: 3
Jan 9, 2023 at 11:40	comment	added	Андрей Петров		Does the agent respond to the environment change as a policy change? It happened to me that due to insufficient training of the model using the SGD optimizer, the entropy of behavior decreased, that is, the probability of some actions increased significantly in relation to others, but this probability did not depend on the environment in any way. The problem was solved by choosing the Adam optimizer
Sep 20, 2022 at 14:28	comment	added	Vladimir Belik		@mikkola Yes, the agent can definitely get rewards through random actions. I understand your overall point, but I guess I don't understand why the agent would converge on the wrong ones, rather than the right ones, given that both are available since the reward is initially high?
Sep 20, 2022 at 14:01	comment	added	mikkola		Is this an environment where the agent can stumble upon the "right solution" by trying out things until something works? Conceptually, it could be that when the policy is still stochastic enough (high entropy), the actions that "solve" the environment are chosen after enough tries. But during your training entropy decreases quickly enough that the policy is close to deterministic (low entropy) before the agent has had a chance to reliably learn the right actions, instead converging on the "wrong" ones and choosing them consistently.
Sep 19, 2022 at 20:13	history	asked	Vladimir Belik	CC BY-SA 4.0