Revisions to Average reward reinforcement learning

added 17 characters in body

edited Jul 7, 2019 at 13:26

29.5k
5
82
101

In general, the average reward setting replaces the discounted setting in continuous tasks. It relies on there being a long term stable distribution of states under any particular policy (this is called ergodicity) - and in this will usually be true for continuous MDPs that don't have absorbing states.

If you see an update rule in the discounted setting that looks like this (for Q learning):

$$Q(s,a) \leftarrow Q(s,a) + \alpha(r + \gamma\text{max}_{a'}Q(s',a') - Q(s,a))$$

Then you replace the discounted TD error by the differential TD error:

$$Q(s,a) \leftarrow Q(s,a) + \alpha(r -\bar{r} + \text{max}_{a'}Q(s',a') - Q(s,a))$$

where $\bar{r}$ is the mean reward per time step under the current policy. You can estimate this simply from the rewards seen so far.

I searched a few articles, but could not find any practical answer.

See Reinforcement Learning: An Introduction (Second Edition) chapter 10, sections 3 and 4 for a more thorough description and more examples.

added 89 characters in body

Source Link

edited Jul 7, 2019 at 7:39

Neil Slater

29.5k
5
82
101

In general, the average reward setting replaces the discounted setting in continuous tasks. It relies on there being a long term stable distribution of states under any particular policy (this is called ergodicity) - and in this will usually be true for continuous MDPs that don't have absorbing states.

If you see an update rule in the discounted setting that looks like this (for Q learning):

$$Q(s,a) \leftarrow Q(s,a) + \alpha(r + \gamma\text{max}_{a'}Q(s',a') - Q(s,a))$$

Then you replace the discounted TD error by the differential TD error:

$$Q(s,a) \leftarrow Q(s,a) + \alpha(r -\bar{r} + \text{max}_{a'}Q(s',a') - Q(s,a))$$

where $\bar{r}$ is the mean reward per time step under the current policy. You can estimate this simply from the rewards seen so far.

I searched a few articles, but could not find any practical answer.

See Reinforcement Learning: An Introduction chapter 10, sections 3 and 4 for a more thorough description and more examples.

Source Link

answered Jul 7, 2019 at 7:33

Neil Slater

29.5k
5
82
101

In general, the average reward setting replaces the discounted setting in continuous tasks. It relies on there being a long term stable distribution of states under any particular policy (this is called ergodicity)

If you see an update rule in the discounted setting that looks like this (for Q learning):

$$Q(s,a) \leftarrow Q(s,a) + \alpha(r + \gamma\text{max}_{a'}Q(s',a') - Q(s,a))$$

Then you replace the discounted TD error by the differential TD error:

$$Q(s,a) \leftarrow Q(s,a) + \alpha(r -\bar{r} + \text{max}_{a'}Q(s',a') - Q(s,a))$$

where $\bar{r}$ is the mean reward per time step under the current policy. You can estimate this simply from the rewards seen so far.

I searched a few articles, but could not find any practical answer.

See Reinforcement Learning: An Introduction chapter 10, sections 3 and 4 for a more thorough description and more examples.

Stack Exchange Network

Return to Answer