Skip to main content
added 52 characters in body
Source Link
maurock
  • 284
  • 3
  • 11

Setting

I am trying to learn a specific physical quantity (radiance) inside a 3D scene with Deep Q-Learning. Just to give a quick overview, my agent shoots rays inside the scene: the reward is the irradiance of the points hit. This means that the reward is given only when a light source is hit - only 1% of the times. This lead to a very sparse reward function.

My state is a tuple of the 3D spatial coordinates inside the scene, my actions are the possible discrete directions used by the agent to scatter rays. The q-values represent this physical quantity based on that specific action(/direction).

Problem

I expect the q-values to be higher in the first 10 actions, and then slightly decrease. This would reflect the physics of my system. When the training starts, this is actually the case:

enter image description here

After some episodes, the q-value of one action starts to spike, as seen in the figure below. This does not reflect the physics of the environment, for which the incoming radiance should be distributed over all the actions.

enter image description here

EDIT:
The trend lines for the q-values can be seen below. Each line is the q-value for a specific action. During the last iterations the q-values for two specific actions explode.

enter image description here
I know about target networks and experience replay (uniform and PER), but this did not solve my issue.
Are

Are there any methods that I can try to deal with this problem? Is this a problem of vanishing/exploding gradient? Thanks

Setting

I am trying to learn a specific physical quantity (radiance) inside a 3D scene with Deep Q-Learning. Just to give a quick overview, my agent shoots rays inside the scene: the reward is the irradiance of the points hit. This means that the reward is given only when a light source is hit - only 1% of the times. This lead to a very sparse reward function.

My state is a tuple of the 3D spatial coordinates inside the scene, my actions are the possible discrete directions used by the agent to scatter rays. The q-values represent this physical quantity based on that specific action(/direction).

Problem

I expect the q-values to be higher in the first 10 actions, and then slightly decrease. This would reflect the physics of my system. When the training starts, this is actually the case:

enter image description here

After some episodes, the q-value of one action starts to spike, as seen in the figure below. This does not reflect the physics of the environment, for which the incoming radiance should be distributed over all the actions.

enter image description here

EDIT:
The trend lines for the q-values can be seen below. Each line is the q-value for a specific action. During the last iterations the q-values for two specific actions explode.

enter image description here
I know about target networks and experience replay (uniform and PER), but this did not solve my issue.
Are there any methods that I can try to deal with this problem? Thanks

Setting

I am trying to learn a specific physical quantity (radiance) inside a 3D scene with Deep Q-Learning. Just to give a quick overview, my agent shoots rays inside the scene: the reward is the irradiance of the points hit. This means that the reward is given only when a light source is hit - only 1% of the times. This lead to a very sparse reward function.

My state is a tuple of the 3D spatial coordinates inside the scene, my actions are the possible discrete directions used by the agent to scatter rays. The q-values represent this physical quantity based on that specific action(/direction).

Problem

I expect the q-values to be higher in the first 10 actions, and then slightly decrease. This would reflect the physics of my system. When the training starts, this is actually the case:

enter image description here

After some episodes, the q-value of one action starts to spike, as seen in the figure below. This does not reflect the physics of the environment, for which the incoming radiance should be distributed over all the actions.

enter image description here

EDIT:
The trend lines for the q-values can be seen below. Each line is the q-value for a specific action. During the last iterations the q-values for two specific actions explode.

enter image description here
I know about target networks and experience replay (uniform and PER), but this did not solve my issue.

Are there any methods that I can try to deal with this problem? Is this a problem of vanishing/exploding gradient? Thanks

added 5 characters in body
Source Link
maurock
  • 284
  • 3
  • 11

Setting

I am trying to learn a specific physical quantity (radiance) inside a 3D scene with Deep Q-Learning. Just to give a quick overview, my agent shoots rays inside the scene: the reward is the irradiance of the points hit. This means that the reward is given only when a light source is hit - only 1% of the times. This lead to a very sparse reward function.

My state is a tuple of the 3D spatial coordinates inside the scene, my actions are the possible discrete directions used by the agent to scatter rays. The q-values represent this physical quantity based on that specific action(/direction).

Problem

I expect the q-values to be higher in the first 10 actions, and then slightly decrease. This would reflect the physics of my system. When the training starts, this is actually the case:

enter image description here

After some episodes, the q-value of one action starts to spike, as seen in the figure below. This does not reflect the physics of the environment, for which the incoming radiance should be distributed over all the actions.

enter image description here

EDIT:
The trend lines for the q-values can be seen below. Each line is the q-value for a specific action. During the last iterations the q-values for two specific actions explode.

enter image description here 
I know about target networks and experience replay (uniform and PER), but this did not solve my issue.
Are there any methods that I can try to deal with this problem? Thanks

Setting

I am trying to learn a specific physical quantity (radiance) inside a 3D scene with Deep Q-Learning. Just to give a quick overview, my agent shoots rays inside the scene: the reward is the irradiance of the points hit. This means that the reward is given only when a light source is hit - only 1% of the times. This lead to a very sparse reward function.

My state is a tuple of the 3D spatial coordinates inside the scene, my actions are the possible discrete directions used by the agent to scatter rays. The q-values represent this physical quantity based on that specific action(/direction).

Problem

I expect the q-values to be higher in the first 10 actions, and then slightly decrease. This would reflect the physics of my system. When the training starts, this is actually the case:

enter image description here

After some episodes, the q-value of one action starts to spike, as seen in the figure below. This does not reflect the physics of the environment, for which the incoming radiance should be distributed over all the actions.

enter image description here

EDIT:
The trend lines for the q-values can be seen below. Each line is the q-value for a specific action. During the last iterations the q-values for two specific actions explode.

enter image description here I know about target networks and experience replay (uniform and PER), but this did not solve my issue.
Are there any methods that I can try to deal with this problem? Thanks

Setting

I am trying to learn a specific physical quantity (radiance) inside a 3D scene with Deep Q-Learning. Just to give a quick overview, my agent shoots rays inside the scene: the reward is the irradiance of the points hit. This means that the reward is given only when a light source is hit - only 1% of the times. This lead to a very sparse reward function.

My state is a tuple of the 3D spatial coordinates inside the scene, my actions are the possible discrete directions used by the agent to scatter rays. The q-values represent this physical quantity based on that specific action(/direction).

Problem

I expect the q-values to be higher in the first 10 actions, and then slightly decrease. This would reflect the physics of my system. When the training starts, this is actually the case:

enter image description here

After some episodes, the q-value of one action starts to spike, as seen in the figure below. This does not reflect the physics of the environment, for which the incoming radiance should be distributed over all the actions.

enter image description here

EDIT:
The trend lines for the q-values can be seen below. Each line is the q-value for a specific action. During the last iterations the q-values for two specific actions explode.

enter image description here 
I know about target networks and experience replay (uniform and PER), but this did not solve my issue.
Are there any methods that I can try to deal with this problem? Thanks

added 274 characters in body
Source Link
maurock
  • 284
  • 3
  • 11

Setting

I am trying to learn a specific physical quantity (radiance) inside a 3D scene with Deep Q-Learning. Just to give a quick overview, my agent shoots rays inside the scene: the reward is the irradiance of the points hit. This means that the reward is given only when a light source is hit - only 1% of the times. This lead to a very sparse reward function.

My state is a tuple of the 3D spatial coordinates inside the scene, my actions are the possible discrete directions used by the agent to scatter rays. The q-values represent this physical quantity based on that specific action(/direction).

Problem

I expect the q-values to be higher in the first 10 actions, and then slightly decrease. This would reflect the physics of my system. When the training starts, this is actually the case:

enter image description here

After some episodes, the q-value of one action starts to spike, as seen in the figure below. This does not reflect the physics of the environment, for which the incoming radiance should be distributed over all the actions.

enter image description here

IEDIT:
The trend lines for the q-values can be seen below. Each line is the q-value for a specific action. During the last iterations the q-values for two specific actions explode.

enter image description here I know about target networks and experience replay (uniform and PER), but this did not solve my issue.
Are there any methods that I can try to deal with this problem? Thanks

Setting

I am trying to learn a specific physical quantity (radiance) inside a 3D scene with Deep Q-Learning. Just to give a quick overview, my agent shoots rays inside the scene: the reward is the irradiance of the points hit. This means that the reward is given only when a light source is hit - only 1% of the times. This lead to a very sparse reward function.

My state is a tuple of the 3D spatial coordinates inside the scene, my actions are the possible discrete directions used by the agent to scatter rays. The q-values represent this physical quantity based on that specific action(/direction).

Problem

I expect the q-values to be higher in the first 10 actions, and then slightly decrease. This would reflect the physics of my system. When the training starts, this is actually the case:

enter image description here

After some episodes, the q-value of one action starts to spike, as seen in the figure below. This does not reflect the physics of the environment, for which the incoming radiance should be distributed over all the actions.

enter image description here

I know about target networks and experience replay (uniform and PER), but this did not solve my issue.
Are there any methods that I can try to deal with this problem? Thanks

Setting

I am trying to learn a specific physical quantity (radiance) inside a 3D scene with Deep Q-Learning. Just to give a quick overview, my agent shoots rays inside the scene: the reward is the irradiance of the points hit. This means that the reward is given only when a light source is hit - only 1% of the times. This lead to a very sparse reward function.

My state is a tuple of the 3D spatial coordinates inside the scene, my actions are the possible discrete directions used by the agent to scatter rays. The q-values represent this physical quantity based on that specific action(/direction).

Problem

I expect the q-values to be higher in the first 10 actions, and then slightly decrease. This would reflect the physics of my system. When the training starts, this is actually the case:

enter image description here

After some episodes, the q-value of one action starts to spike, as seen in the figure below. This does not reflect the physics of the environment, for which the incoming radiance should be distributed over all the actions.

enter image description here

EDIT:
The trend lines for the q-values can be seen below. Each line is the q-value for a specific action. During the last iterations the q-values for two specific actions explode.

enter image description here I know about target networks and experience replay (uniform and PER), but this did not solve my issue.
Are there any methods that I can try to deal with this problem? Thanks

edited tags
Link
Simon Larsson
  • 4.3k
  • 1
  • 16
  • 30
Loading
Source Link
maurock
  • 284
  • 3
  • 11
Loading