2
$\begingroup$

What is the bellman equation update rule for the average reward reinforcement learning? I searched a few articles, but could not find any practical answer.

$\endgroup$

1 Answer 1

1
$\begingroup$

In general, the average reward setting replaces the discounted setting in continuous tasks. It relies on there being a long term stable distribution of states under any particular policy (this is called ergodicity) - and in this will usually be true for continuous MDPs that don't have absorbing states.

If you see an update rule in the discounted setting that looks like this (for Q learning):

$$Q(s,a) \leftarrow Q(s,a) + \alpha(r + \gamma\text{max}_{a'}Q(s',a') - Q(s,a))$$

Then you replace the discounted TD error by the differential TD error:

$$Q(s,a) \leftarrow Q(s,a) + \alpha(r -\bar{r} + \text{max}_{a'}Q(s',a') - Q(s,a))$$

where $\bar{r}$ is the mean reward per time step under the current policy. You can estimate this simply from the rewards seen so far.

I searched a few articles, but could not find any practical answer.

See Reinforcement Learning: An Introduction (Second Edition) chapter 10, sections 3 and 4 for a more thorough description and more examples.

$\endgroup$
2
  • $\begingroup$ Thank you again neil!.But I could not understand the intuition behind differential TD error how this equation carry average reward with thousands of states can you give me a simple example $\endgroup$ Commented Jul 7, 2019 at 18:22
  • $\begingroup$ @uğuryıldırım: The average reward is not related to individual states. It is an average over all time steps. For more details and intuition, please read chapter 10 of Sutton & Barto. $\endgroup$ Commented Jul 7, 2019 at 19:03

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.