My Q-Learning algorithm's state values keep on diverging to infinity, which means my weights are diverging too. I use a neural network for my value-mapping.
I've tried:
- Clipping the "reward + discount * maximum value of action" (max/min set to 50/-50)
- Setting a low learning rate (0.00001 and I use the classic Backpropagation for updating the weights)
- Decreasing the values of the rewards
- Increasing the exploration rate
- Normalizing the inputs to between 1~100 (previously it was 0~1)
- Change the discount rate
- Decrease the layers of the neural network (just for validation)
I've heard that Q Learning is known to diverge on non-linear input, but are there anything else that I can try to stop the divergence of the weights?