Average reward reinforcement learning

Question

What is the bellman equation update rule for the average reward reinforcement learning? I searched a few articles, but could not find any practical answer.

Neil Slater · Accepted Answer · 2019-07-07 13:26:03Z

In general, the average reward setting replaces the discounted setting in continuous tasks. It relies on there being a long term stable distribution of states under any particular policy (this is called ergodicity) - and in this will usually be true for continuous MDPs that don't have absorbing states.

If you see an update rule in the discounted setting that looks like this (for Q learning):

$$Q(s,a) \leftarrow Q(s,a) + \alpha(r + \gamma\text{max}_{a'}Q(s',a') - Q(s,a))$$

Then you replace the discounted TD error by the differential TD error:

$$Q(s,a) \leftarrow Q(s,a) + \alpha(r -\bar{r} + \text{max}_{a'}Q(s',a') - Q(s,a))$$

where $\bar{r}$ is the mean reward per time step under the current policy. You can estimate this simply from the rewards seen so far.

I searched a few articles, but could not find any practical answer.

See Reinforcement Learning: An Introduction (Second Edition) chapter 10, sections 3 and 4 for a more thorough description and more examples.

Thank you again neil!.But I could not understand the intuition behind differential TD error how this equation carry average reward with thousands of states can you give me a simple example — uğur yıldırım
– uğur yıldırım, Commented Jul 7, 2019 at 18:22
@uğuryıldırım: The average reward is not related to individual states. It is an average over all time steps. For more details and intuition, please read chapter 10 of Sutton & Barto. — Neil Slater
– Neil Slater, Commented Jul 7, 2019 at 19:03

Stack Exchange Network

Average reward reinforcement learning

1 Answer 1

Hot Network Questions

Average reward reinforcement learning

1 Answer 1

Related

Hot Network Questions