4
$\begingroup$

I've been working through the Sutton + Barto RL text, implementing a number of the algos + running them in the OpenAI gym. One phenomenon that I seem to come across quite regularly is that agents who, at certain points during their training appear to be making good progress towards learning a plausible state-value/state-action functions, "catastrophically forget" the insights they glean and subsequently never recover.

To make this more concrete, here's the (smoothed) reward history of my implementation of a semi-gradient expected SARSA agent with linear function approximation and binary features running on the MountainCar environment.

Reward history

Problem Details

  • The problem has a bounded, continuous state space and a discrete, 3-action action space.
  • The learner receives a reward of $-1$ at each timestep. The episode ends when the car makes its way all the way up the hill OR it reaches 200 timesteps without success.
  • Prior to learning, I tile-coded the state space using an 8x8 grid with 8 overlapping tilings, generated using Sutton's own tiles.py script.

Learner Details

  • The learner is an implementation of the semi-gradient SARSA algorithm with linear function approximation for the Q value.
  • The agent begins with all Q weights initialized to 0.
  • The learning rate for the agent was set at $\frac{1}{10} \times (\text{# tiles})^{-1}$, where $\text{# tiles}$ in this case was $8 \times 8 \times 8 = 512$
  • During learning, the agent selected its actions using an $\epsilon$-soft policy, where $\epsilon$ is set to 0.10
  • The agent's temporal discount factor is set to $\gamma = 0.95$ (though I realize it could have been 1 given the episodic nature of the task)

My Question

One possibility I want to rule out is that my implementation of the learning agent is incorrect. To that end I am curious to know whether others experience this kind of "forgetting" behavior (even on simple problems like this), and if so, how it might be reduced.

$\endgroup$
1
  • $\begingroup$ @StuBernis I am interested in studying your code $\endgroup$ Commented Apr 17, 2021 at 21:09

1 Answer 1

1
$\begingroup$

The same thing was happening to me with a deep Q Network on the cart-pole problem. Having a "memory" with past (S,A,R,S) sequences and sampling it to form mini batches with the new observations helped a lot to reduce catastrophic forgetting. Reducing the step size once the agent has improved a certain amount also helped.

$\endgroup$

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.