Revisions to Deep Q-Learning for physical quantity: q-values distribution not as expected

added 52 characters in body

edited Jul 29, 2019 at 20:28

284
3
11

Setting

I am trying to learn a specific physical quantity (radiance) inside a 3D scene with Deep Q-Learning. Just to give a quick overview, my agent shoots rays inside the scene: the reward is the irradiance of the points hit. This means that the reward is given only when a light source is hit - only 1% of the times. This lead to a very sparse reward function.

My state is a tuple of the 3D spatial coordinates inside the scene, my actions are the possible discrete directions used by the agent to scatter rays. The q-values represent this physical quantity based on that specific action(/direction).

Problem

I expect the q-values to be higher in the first 10 actions, and then slightly decrease. This would reflect the physics of my system. When the training starts, this is actually the case:

After some episodes, the q-value of one action starts to spike, as seen in the figure below. This does not reflect the physics of the environment, for which the incoming radiance should be distributed over all the actions.

EDIT:
The trend lines for the q-values can be seen below. Each line is the q-value for a specific action. During the last iterations the q-values for two specific actions explode.

I know about target networks and experience replay (uniform and PER), but this did not solve my issue.
Are

Are there any methods that I can try to deal with this problem? Is this a problem of vanishing/exploding gradient? Thanks

Setting

I am trying to learn a specific physical quantity (radiance) inside a 3D scene with Deep Q-Learning. Just to give a quick overview, my agent shoots rays inside the scene: the reward is the irradiance of the points hit. This means that the reward is given only when a light source is hit - only 1% of the times. This lead to a very sparse reward function.

My state is a tuple of the 3D spatial coordinates inside the scene, my actions are the possible discrete directions used by the agent to scatter rays. The q-values represent this physical quantity based on that specific action(/direction).

Problem

I expect the q-values to be higher in the first 10 actions, and then slightly decrease. This would reflect the physics of my system. When the training starts, this is actually the case:

After some episodes, the q-value of one action starts to spike, as seen in the figure below. This does not reflect the physics of the environment, for which the incoming radiance should be distributed over all the actions.

EDIT:
The trend lines for the q-values can be seen below. Each line is the q-value for a specific action. During the last iterations the q-values for two specific actions explode.

I know about target networks and experience replay (uniform and PER), but this did not solve my issue.
Are there any methods that I can try to deal with this problem? Thanks

Setting

I am trying to learn a specific physical quantity (radiance) inside a 3D scene with Deep Q-Learning. Just to give a quick overview, my agent shoots rays inside the scene: the reward is the irradiance of the points hit. This means that the reward is given only when a light source is hit - only 1% of the times. This lead to a very sparse reward function.

My state is a tuple of the 3D spatial coordinates inside the scene, my actions are the possible discrete directions used by the agent to scatter rays. The q-values represent this physical quantity based on that specific action(/direction).

Problem

I expect the q-values to be higher in the first 10 actions, and then slightly decrease. This would reflect the physics of my system. When the training starts, this is actually the case:

After some episodes, the q-value of one action starts to spike, as seen in the figure below. This does not reflect the physics of the environment, for which the incoming radiance should be distributed over all the actions.

EDIT:
The trend lines for the q-values can be seen below. Each line is the q-value for a specific action. During the last iterations the q-values for two specific actions explode.

I know about target networks and experience replay (uniform and PER), but this did not solve my issue.

Are there any methods that I can try to deal with this problem? Is this a problem of vanishing/exploding gradient? Thanks

added 5 characters in body

Source Link

edited Jul 29, 2019 at 12:49

maurock

284
3
11

Setting

I am trying to learn a specific physical quantity (radiance) inside a 3D scene with Deep Q-Learning. Just to give a quick overview, my agent shoots rays inside the scene: the reward is the irradiance of the points hit. This means that the reward is given only when a light source is hit - only 1% of the times. This lead to a very sparse reward function.

My state is a tuple of the 3D spatial coordinates inside the scene, my actions are the possible discrete directions used by the agent to scatter rays. The q-values represent this physical quantity based on that specific action(/direction).

Problem

I expect the q-values to be higher in the first 10 actions, and then slightly decrease. This would reflect the physics of my system. When the training starts, this is actually the case:

After some episodes, the q-value of one action starts to spike, as seen in the figure below. This does not reflect the physics of the environment, for which the incoming radiance should be distributed over all the actions.

EDIT:
The trend lines for the q-values can be seen below. Each line is the q-value for a specific action. During the last iterations the q-values for two specific actions explode.

I know about target networks and experience replay (uniform and PER), but this did not solve my issue.
Are there any methods that I can try to deal with this problem? Thanks

Setting

I am trying to learn a specific physical quantity (radiance) inside a 3D scene with Deep Q-Learning. Just to give a quick overview, my agent shoots rays inside the scene: the reward is the irradiance of the points hit. This means that the reward is given only when a light source is hit - only 1% of the times. This lead to a very sparse reward function.

My state is a tuple of the 3D spatial coordinates inside the scene, my actions are the possible discrete directions used by the agent to scatter rays. The q-values represent this physical quantity based on that specific action(/direction).

Problem

I expect the q-values to be higher in the first 10 actions, and then slightly decrease. This would reflect the physics of my system. When the training starts, this is actually the case:

After some episodes, the q-value of one action starts to spike, as seen in the figure below. This does not reflect the physics of the environment, for which the incoming radiance should be distributed over all the actions.

EDIT:
The trend lines for the q-values can be seen below. Each line is the q-value for a specific action. During the last iterations the q-values for two specific actions explode.

I know about target networks and experience replay (uniform and PER), but this did not solve my issue.
Are there any methods that I can try to deal with this problem? Thanks

Setting

I am trying to learn a specific physical quantity (radiance) inside a 3D scene with Deep Q-Learning. Just to give a quick overview, my agent shoots rays inside the scene: the reward is the irradiance of the points hit. This means that the reward is given only when a light source is hit - only 1% of the times. This lead to a very sparse reward function.

My state is a tuple of the 3D spatial coordinates inside the scene, my actions are the possible discrete directions used by the agent to scatter rays. The q-values represent this physical quantity based on that specific action(/direction).

Problem

I expect the q-values to be higher in the first 10 actions, and then slightly decrease. This would reflect the physics of my system. When the training starts, this is actually the case:

After some episodes, the q-value of one action starts to spike, as seen in the figure below. This does not reflect the physics of the environment, for which the incoming radiance should be distributed over all the actions.

EDIT:
The trend lines for the q-values can be seen below. Each line is the q-value for a specific action. During the last iterations the q-values for two specific actions explode.

I know about target networks and experience replay (uniform and PER), but this did not solve my issue.
Are there any methods that I can try to deal with this problem? Thanks

added 274 characters in body

Source Link

edited Jul 29, 2019 at 12:41

maurock

284
3
11

Setting

I am trying to learn a specific physical quantity (radiance) inside a 3D scene with Deep Q-Learning. Just to give a quick overview, my agent shoots rays inside the scene: the reward is the irradiance of the points hit. This means that the reward is given only when a light source is hit - only 1% of the times. This lead to a very sparse reward function.

My state is a tuple of the 3D spatial coordinates inside the scene, my actions are the possible discrete directions used by the agent to scatter rays. The q-values represent this physical quantity based on that specific action(/direction).

Problem

I expect the q-values to be higher in the first 10 actions, and then slightly decrease. This would reflect the physics of my system. When the training starts, this is actually the case:

After some episodes, the q-value of one action starts to spike, as seen in the figure below. This does not reflect the physics of the environment, for which the incoming radiance should be distributed over all the actions.

IEDIT:
The trend lines for the q-values can be seen below. Each line is the q-value for a specific action. During the last iterations the q-values for two specific actions explode.

I know about target networks and experience replay (uniform and PER), but this did not solve my issue.
Are there any methods that I can try to deal with this problem? Thanks

Setting

I am trying to learn a specific physical quantity (radiance) inside a 3D scene with Deep Q-Learning. Just to give a quick overview, my agent shoots rays inside the scene: the reward is the irradiance of the points hit. This means that the reward is given only when a light source is hit - only 1% of the times. This lead to a very sparse reward function.

My state is a tuple of the 3D spatial coordinates inside the scene, my actions are the possible discrete directions used by the agent to scatter rays. The q-values represent this physical quantity based on that specific action(/direction).

Problem

I expect the q-values to be higher in the first 10 actions, and then slightly decrease. This would reflect the physics of my system. When the training starts, this is actually the case:

After some episodes, the q-value of one action starts to spike, as seen in the figure below. This does not reflect the physics of the environment, for which the incoming radiance should be distributed over all the actions.

I know about target networks and experience replay (uniform and PER), but this did not solve my issue.
Are there any methods that I can try to deal with this problem? Thanks

Setting

I am trying to learn a specific physical quantity (radiance) inside a 3D scene with Deep Q-Learning. Just to give a quick overview, my agent shoots rays inside the scene: the reward is the irradiance of the points hit. This means that the reward is given only when a light source is hit - only 1% of the times. This lead to a very sparse reward function.

My state is a tuple of the 3D spatial coordinates inside the scene, my actions are the possible discrete directions used by the agent to scatter rays. The q-values represent this physical quantity based on that specific action(/direction).

Problem

I expect the q-values to be higher in the first 10 actions, and then slightly decrease. This would reflect the physics of my system. When the training starts, this is actually the case:

After some episodes, the q-value of one action starts to spike, as seen in the figure below. This does not reflect the physics of the environment, for which the incoming radiance should be distributed over all the actions.

EDIT:
The trend lines for the q-values can be seen below. Each line is the q-value for a specific action. During the last iterations the q-values for two specific actions explode.

I know about target networks and experience replay (uniform and PER), but this did not solve my issue.
Are there any methods that I can try to deal with this problem? Thanks

edited tags

Link

edited Jul 29, 2019 at 12:07

Simon Larsson

4.3k
1
16
30

Loading

Source Link

asked Jul 29, 2019 at 9:24

maurock

284
3
11

Loading

Stack Exchange Network

Return to Question