Skip to main content
added 17 characters in body
Source Link
Neil Slater
  • 29.5k
  • 5
  • 82
  • 101

In general, the average reward setting replaces the discounted setting in continuous tasks. It relies on there being a long term stable distribution of states under any particular policy (this is called ergodicity) - and in this will usually be true for continuous MDPs that don't have absorbing states.

If you see an update rule in the discounted setting that looks like this (for Q learning):

$$Q(s,a) \leftarrow Q(s,a) + \alpha(r + \gamma\text{max}_{a'}Q(s',a') - Q(s,a))$$

Then you replace the discounted TD error by the differential TD error:

$$Q(s,a) \leftarrow Q(s,a) + \alpha(r -\bar{r} + \text{max}_{a'}Q(s',a') - Q(s,a))$$

where $\bar{r}$ is the mean reward per time step under the current policy. You can estimate this simply from the rewards seen so far.

I searched a few articles, but could not find any practical answer.

See Reinforcement Learning: An Introduction (Second Edition) chapter 10, sections 3 and 4 for a more thorough description and more examples.

In general, the average reward setting replaces the discounted setting in continuous tasks. It relies on there being a long term stable distribution of states under any particular policy (this is called ergodicity) - and in this will usually be true for continuous MDPs that don't have absorbing states.

If you see an update rule in the discounted setting that looks like this (for Q learning):

$$Q(s,a) \leftarrow Q(s,a) + \alpha(r + \gamma\text{max}_{a'}Q(s',a') - Q(s,a))$$

Then you replace the discounted TD error by the differential TD error:

$$Q(s,a) \leftarrow Q(s,a) + \alpha(r -\bar{r} + \text{max}_{a'}Q(s',a') - Q(s,a))$$

where $\bar{r}$ is the mean reward per time step under the current policy. You can estimate this simply from the rewards seen so far.

I searched a few articles, but could not find any practical answer.

See Reinforcement Learning: An Introduction chapter 10, sections 3 and 4 for a more thorough description and more examples.

In general, the average reward setting replaces the discounted setting in continuous tasks. It relies on there being a long term stable distribution of states under any particular policy (this is called ergodicity) - and in this will usually be true for continuous MDPs that don't have absorbing states.

If you see an update rule in the discounted setting that looks like this (for Q learning):

$$Q(s,a) \leftarrow Q(s,a) + \alpha(r + \gamma\text{max}_{a'}Q(s',a') - Q(s,a))$$

Then you replace the discounted TD error by the differential TD error:

$$Q(s,a) \leftarrow Q(s,a) + \alpha(r -\bar{r} + \text{max}_{a'}Q(s',a') - Q(s,a))$$

where $\bar{r}$ is the mean reward per time step under the current policy. You can estimate this simply from the rewards seen so far.

I searched a few articles, but could not find any practical answer.

See Reinforcement Learning: An Introduction (Second Edition) chapter 10, sections 3 and 4 for a more thorough description and more examples.

added 89 characters in body
Source Link
Neil Slater
  • 29.5k
  • 5
  • 82
  • 101

In general, the average reward setting replaces the discounted setting in continuous tasks. It relies on there being a long term stable distribution of states under any particular policy (this is called ergodicity) - and in this will usually be true for continuous MDPs that don't have absorbing states.

If you see an update rule in the discounted setting that looks like this (for Q learning):

$$Q(s,a) \leftarrow Q(s,a) + \alpha(r + \gamma\text{max}_{a'}Q(s',a') - Q(s,a))$$

Then you replace the discounted TD error by the differential TD error:

$$Q(s,a) \leftarrow Q(s,a) + \alpha(r -\bar{r} + \text{max}_{a'}Q(s',a') - Q(s,a))$$

where $\bar{r}$ is the mean reward per time step under the current policy. You can estimate this simply from the rewards seen so far.

I searched a few articles, but could not find any practical answer.

See Reinforcement Learning: An Introduction chapter 10, sections 3 and 4 for a more thorough description and more examples.

In general, the average reward setting replaces the discounted setting in continuous tasks. It relies on there being a long term stable distribution of states under any particular policy (this is called ergodicity)

If you see an update rule in the discounted setting that looks like this (for Q learning):

$$Q(s,a) \leftarrow Q(s,a) + \alpha(r + \gamma\text{max}_{a'}Q(s',a') - Q(s,a))$$

Then you replace the discounted TD error by the differential TD error:

$$Q(s,a) \leftarrow Q(s,a) + \alpha(r -\bar{r} + \text{max}_{a'}Q(s',a') - Q(s,a))$$

where $\bar{r}$ is the mean reward per time step under the current policy. You can estimate this simply from the rewards seen so far.

I searched a few articles, but could not find any practical answer.

See Reinforcement Learning: An Introduction chapter 10, sections 3 and 4 for a more thorough description and more examples.

In general, the average reward setting replaces the discounted setting in continuous tasks. It relies on there being a long term stable distribution of states under any particular policy (this is called ergodicity) - and in this will usually be true for continuous MDPs that don't have absorbing states.

If you see an update rule in the discounted setting that looks like this (for Q learning):

$$Q(s,a) \leftarrow Q(s,a) + \alpha(r + \gamma\text{max}_{a'}Q(s',a') - Q(s,a))$$

Then you replace the discounted TD error by the differential TD error:

$$Q(s,a) \leftarrow Q(s,a) + \alpha(r -\bar{r} + \text{max}_{a'}Q(s',a') - Q(s,a))$$

where $\bar{r}$ is the mean reward per time step under the current policy. You can estimate this simply from the rewards seen so far.

I searched a few articles, but could not find any practical answer.

See Reinforcement Learning: An Introduction chapter 10, sections 3 and 4 for a more thorough description and more examples.

Source Link
Neil Slater
  • 29.5k
  • 5
  • 82
  • 101

In general, the average reward setting replaces the discounted setting in continuous tasks. It relies on there being a long term stable distribution of states under any particular policy (this is called ergodicity)

If you see an update rule in the discounted setting that looks like this (for Q learning):

$$Q(s,a) \leftarrow Q(s,a) + \alpha(r + \gamma\text{max}_{a'}Q(s',a') - Q(s,a))$$

Then you replace the discounted TD error by the differential TD error:

$$Q(s,a) \leftarrow Q(s,a) + \alpha(r -\bar{r} + \text{max}_{a'}Q(s',a') - Q(s,a))$$

where $\bar{r}$ is the mean reward per time step under the current policy. You can estimate this simply from the rewards seen so far.

I searched a few articles, but could not find any practical answer.

See Reinforcement Learning: An Introduction chapter 10, sections 3 and 4 for a more thorough description and more examples.