What is the bellman equation update rule for the average reward reinforcement learning? I searched a few articles, but could not find any practical answer.
1 Answer
In general, the average reward setting replaces the discounted setting in continuous tasks. It relies on there being a long term stable distribution of states under any particular policy (this is called ergodicity) - and in this will usually be true for continuous MDPs that don't have absorbing states.
If you see an update rule in the discounted setting that looks like this (for Q learning):
$$Q(s,a) \leftarrow Q(s,a) + \alpha(r + \gamma\text{max}_{a'}Q(s',a') - Q(s,a))$$
Then you replace the discounted TD error by the differential TD error:
$$Q(s,a) \leftarrow Q(s,a) + \alpha(r -\bar{r} + \text{max}_{a'}Q(s',a') - Q(s,a))$$
where $\bar{r}$ is the mean reward per time step under the current policy. You can estimate this simply from the rewards seen so far.
I searched a few articles, but could not find any practical answer.
See Reinforcement Learning: An Introduction (Second Edition) chapter 10, sections 3 and 4 for a more thorough description and more examples.
- $\begingroup$ Thank you again neil!.But I could not understand the intuition behind differential TD error how this equation carry average reward with thousands of states can you give me a simple example $\endgroup$uğur yıldırım– uğur yıldırım2019-07-07 18:22:19 +00:00Commented Jul 7, 2019 at 18:22
- $\begingroup$ @uğuryıldırım: The average reward is not related to individual states. It is an average over all time steps. For more details and intuition, please read chapter 10 of Sutton & Barto. $\endgroup$Neil Slater– Neil Slater2019-07-07 19:03:07 +00:00Commented Jul 7, 2019 at 19:03