Skip to main content
5 events
when toggle format what by license comment
Jan 9, 2023 at 12:23 answer added Kostya timeline score: 3
Jan 9, 2023 at 11:40 comment added Андрей Петров Does the agent respond to the environment change as a policy change? It happened to me that due to insufficient training of the model using the SGD optimizer, the entropy of behavior decreased, that is, the probability of some actions increased significantly in relation to others, but this probability did not depend on the environment in any way. The problem was solved by choosing the Adam optimizer
Sep 20, 2022 at 14:28 comment added Vladimir Belik @mikkola Yes, the agent can definitely get rewards through random actions. I understand your overall point, but I guess I don't understand why the agent would converge on the wrong ones, rather than the right ones, given that both are available since the reward is initially high?
Sep 20, 2022 at 14:01 comment added mikkola Is this an environment where the agent can stumble upon the "right solution" by trying out things until something works? Conceptually, it could be that when the policy is still stochastic enough (high entropy), the actions that "solve" the environment are chosen after enough tries. But during your training entropy decreases quickly enough that the policy is close to deterministic (low entropy) before the agent has had a chance to reliably learn the right actions, instead converging on the "wrong" ones and choosing them consistently.
Sep 19, 2022 at 20:13 history asked Vladimir Belik CC BY-SA 4.0