Why does Q Learning diverge?

Question

My Q-Learning algorithm's state values keep on diverging to infinity, which means my weights are diverging too. I use a neural network for my value-mapping.

I've tried:

Clipping the "reward + discount * maximum value of action" (max/min set to 50/-50)
Setting a low learning rate (0.00001 and I use the classic Backpropagation for updating the weights)
Decreasing the values of the rewards
Increasing the exploration rate
Normalizing the inputs to between 1~100 (previously it was 0~1)
Change the discount rate
Decrease the layers of the neural network (just for validation)

I've heard that Q Learning is known to diverge on non-linear input, but are there anything else that I can try to stop the divergence of the weights?

Update #1 on August 14th, 2017:

I've decided to add some specific details on what I'm doing right now due to a request to.

I'm currently trying to make an agent learn how to fight in a top-down view of a shooting game. The opponent is a simple bot which moves stochastically.

Each character has 9 actions to choose from on each turn:

move up
move down
move left
move right
shoot a bullet upwards
shoot a bullet downwards
shoot a bullet to the left
shoot a bullet to the right
do nothing

The rewards are:

if agent hits the bot with a bullet, +100 (I've tried many different values)
if agent gets hit by a bullet shot by the bot, -50 (again, I've tried many different values)
if the agent tries to fire a bullet while bullets can't be fired(ex. when the agent just fired a bullet, etc. ), -25(Not necessary but I wanted the agent to be more efficient)
if the bot tries to go out of the arena, -20(Not necessary too but I wanted the agent to be more efficient)

The inputs for the neural network are:

Distance between the agent and the bot on the X axis normalized to 0~100
Distance between the agent and the bot on the Y axis normalized to 0~100
Agent's x and y positions
Bot's x and y positions
Bot's bullet position. If the bot didn't fire a bullet, the parameters are set to the x and y positions of the bot.

I've also fiddled with the inputs too; I tried adding new features like the x value of the agent's position(not the distance but the actual position)and the position of the bot's bullet. None of them worked.

Here's the code:

from pygame import * from pygame.locals import * import sys from time import sleep import numpy as np import random import tensorflow as tf from pylab import savefig from tqdm import tqdm #Screen Setup disp_x, disp_y = 1000, 800 arena_x, arena_y = 1000, 800 border = 4; border_2 = 1 #Color Setup white = (255, 255, 255); aqua= (0, 200, 200) red = (255, 0, 0); green = (0, 255, 0) blue = (0, 0, 255); black = (0, 0, 0) green_yellow = (173, 255, 47); energy_blue = (125, 249, 255) #Initialize character positions init_character_a_state = [disp_x/2 - arena_x/2 + 50, disp_y/2 - arena_y/2 + 50] init_character_b_state = [disp_x/2 + arena_x/2 - 50, disp_y/2 + arena_y/2 - 50] #Setup character dimentions character_size = 50 character_move_speed = 25 #Initialize character stats character_init_health = 100 #initialize bullet stats beam_damage = 10 beam_width = 10 beam_ob = -100 #The Neural Network input_layer = tf.placeholder(shape=[1,7],dtype=tf.float32) weight_1 = tf.Variable(tf.random_uniform([7,9],0,0.1)) #weight_2 = tf.Variable(tf.random_uniform([6,9],0,0.1)) #The calculations, loss function and the update model Q = tf.matmul(input_layer, weight_1) predict = tf.argmax(Q, 1) next_Q = tf.placeholder(shape=[1,9],dtype=tf.float32) loss = tf.reduce_sum(tf.square(next_Q - Q)) trainer = tf.train.GradientDescentOptimizer(learning_rate=0.001) updateModel = trainer.minimize(loss) initialize = tf.global_variables_initializer() jList = [] rList = [] init() font.init() myfont = font.SysFont('Comic Sans MS', 15) myfont2 = font.SysFont('Comic Sans MS', 150) myfont3 = font.SysFont('Gothic', 30) disp = display.set_mode((disp_x, disp_y), 0, 32) #CHARACTER/BULLET PARAMETERS agent_x = agent_y = int() bot_x = bot_y = int() agent_hp = bot_hp = int() bot_beam_dir = int() agent_beam_fire = bot_beam_fire = bool() agent_beam_x = bot_beam_x = agent_beam_y = bot_beam_y = int() agent_beam_size_x = agent_beam_size_y = bot_beam_size_x = bot_beam_size_y = int() bot_current_action = agent_current_action = int() def param_init(): """Initializes parameters""" global agent_x, agent_y, bot_x, bot_y, agent_hp, bot_hp, agent_beam_fire, bot_beam_fire, agent_beam_x, bot_beam_x, agent_beam_y, bot_beam_y agent_x = list(init_character_a_state)[0]; agent_y = list(init_character_a_state)[1] bot_x = list(init_character_b_state)[0]; bot_y = list(init_character_b_state)[1] agent_hp = bot_hp = character_init_health agent_beam_fire = bot_beam_fire = False agent_beam_x = bot_beam_x = agent_beam_y = bot_beam_y = beam_ob agent_beam_size_x = agent_beam_size_y = bot_beam_size_x = bot_beam_size_y = 0 def screen_blit(): global disp, disp_x, disp_y, arena_x, arena_y, border, border_2, character_size, agent_x, \ agent_y, bot_x, bot_y, character_init_health, agent_hp, bot_hp, red, blue, aqua, green, black, green_yellow, energy_blue, \ agent_beam_fire, bot_beam_fire, agent_beam_x, agent_beam_y, bot_beam_x, bot_beam_y, agent_beam_size_x, agent_beam_size_y, bot_beam_size_x, bot_beam_size_y, beam_width disp.fill(aqua) draw.rect(disp, black, (disp_x / 2 - arena_x / 2 - border, disp_y / 2 - arena_y / 2 - border, arena_x + border * 2, arena_y + border * 2)) draw.rect(disp, green, (disp_x / 2 - arena_x / 2, disp_y / 2 - arena_y / 2, arena_x, arena_y)) if bot_beam_fire == True: draw.rect(disp, green_yellow, (agent_beam_x, agent_beam_y, agent_beam_size_x, agent_beam_size_y)) bot_beam_fire = False if agent_beam_fire == True: draw.rect(disp, energy_blue, (bot_beam_x, bot_beam_y, bot_beam_size_x, bot_beam_size_y)) agent_beam_fire = False draw.rect(disp, red, (agent_x, agent_y, character_size, character_size)) draw.rect(disp, blue, (bot_x, bot_y, character_size, character_size)) draw.rect(disp, red, (disp_x / 2 - 200, disp_y / 2 + arena_y / 2 + border + 1, float(agent_hp) / float(character_init_health) * 100, 14)) draw.rect(disp, blue, (disp_x / 2 + 200, disp_y / 2 + arena_y / 2 + border + 1, float(bot_hp) / float(character_init_health) * 100, 14)) def bot_take_action(): return random.randint(1, 9) def beam_hit_detector(player): global agent_x, agent_y, bot_x, bot_y, agent_beam_fire, bot_beam_fire, agent_beam_x, \ bot_beam_x, agent_beam_y, bot_beam_y, agent_beam_size_x, agent_beam_size_y, \ bot_beam_size_x, bot_beam_size_y, bot_current_action, agent_current_action, beam_width, character_size if player == "bot": if bot_current_action == 1: if disp_y/2 - arena_y/2 <= agent_y <= bot_y and (agent_x < bot_beam_x + beam_width < agent_x + character_size or agent_x < bot_beam_x < agent_x + character_size): return True else: return False elif bot_current_action == 2: if bot_x <= agent_x <= disp_x/2 + arena_x/2 and (agent_y < bot_beam_y + beam_width < agent_y + character_size or agent_y < bot_beam_y < agent_y + character_size): return True else: return False elif bot_current_action == 3: if bot_y <= agent_y <= disp_y/2 + arena_y/2 and (agent_x < bot_beam_x + beam_width < agent_x + character_size or agent_x < bot_beam_x < agent_x + character_size): return True else: return False elif bot_current_action == 4: if disp_x/2 - arena_x/2 <= agent_x <= bot_x and (agent_y < bot_beam_y + beam_width < agent_y + character_size or agent_y < bot_beam_y < agent_y + character_size): return True else: return False else: if agent_current_action == 1: if disp_y/2 - arena_y/2 <= bot_y <= agent_y and (bot_x < agent_beam_x + beam_width < bot_x + character_size or bot_x < agent_beam_x < bot_x + character_size): return True else: return False elif agent_current_action == 2: if agent_x <= bot_x <= disp_x/2 + arena_x/2 and (bot_y < agent_beam_y + beam_width < bot_y + character_size or bot_y < agent_beam_y < bot_y + character_size): return True else: return False elif agent_current_action == 3: if agent_y <= bot_y <= disp_y/2 + arena_y/2 and (bot_x < agent_beam_x + beam_width < bot_x + character_size or bot_x < agent_beam_x < bot_x + character_size): return True else: return False elif bot_current_action == 4: if disp_x/2 - arena_x/2 <= bot_x <= agent_x and (bot_y < agent_beam_y + beam_width < bot_y + character_size or bot_y < agent_beam_y < bot_y + character_size): return True else: return False def mapping(maximum, number): return number#int(number * maximum) def action(agent_action, bot_action): global agent_x, agent_y, bot_x, bot_y, agent_hp, bot_hp, agent_beam_fire, \ bot_beam_fire, agent_beam_x, bot_beam_x, agent_beam_y, bot_beam_y, agent_beam_size_x, \ agent_beam_size_y, bot_beam_size_x, bot_beam_size_y, beam_width, agent_current_action, bot_current_action, character_size agent_current_action = agent_action; bot_current_action = bot_action reward = 0; cont = True; successful = False; winner = "" if 1 <= bot_action <= 4: bot_beam_fire = True if bot_action == 1: bot_beam_x = bot_x + character_size/2 - beam_width/2; bot_beam_y = disp_y/2 - arena_y/2 bot_beam_size_x = beam_width; bot_beam_size_y = bot_y - disp_y/2 + arena_y/2 elif bot_action == 2: bot_beam_x = bot_x + character_size; bot_beam_y = bot_y + character_size/2 - beam_width/2 bot_beam_size_x = disp_x/2 + arena_x/2 - bot_x - character_size; bot_beam_size_y = beam_width elif bot_action == 3: bot_beam_x = bot_x + character_size/2 - beam_width/2; bot_beam_y = bot_y + character_size bot_beam_size_x = beam_width; bot_beam_size_y = disp_y/2 + arena_y/2 - bot_y - character_size elif bot_action == 4: bot_beam_x = disp_x/2 - arena_x/2; bot_beam_y = bot_y + character_size/2 - beam_width/2 bot_beam_size_x = bot_x - disp_x/2 + arena_x/2; bot_beam_size_y = beam_width elif 5 <= bot_action <= 8: if bot_action == 5: bot_y -= character_move_speed if bot_y <= disp_y/2 - arena_y/2: bot_y = disp_y/2 - arena_y/2 elif agent_y <= bot_y <= agent_y + character_size: bot_y = agent_y + character_size elif bot_action == 6: bot_x += character_move_speed if bot_x >= disp_x/2 + arena_x/2 - character_size: bot_x = disp_x/2 + arena_x/2 - character_size elif agent_x <= bot_x + character_size <= agent_x + character_size: bot_x = agent_x - character_size elif bot_action == 7: bot_y += character_move_speed if bot_y + character_size >= disp_y/2 + arena_y/2: bot_y = disp_y/2 + arena_y/2 - character_size elif agent_y <= bot_y + character_size <= agent_y + character_size: bot_y = agent_y - character_size elif bot_action == 8: bot_x -= character_move_speed if bot_x <= disp_x/2 - arena_x/2: bot_x = disp_x/2 - arena_x/2 elif agent_x <= bot_x <= agent_x + character_size: bot_x = agent_x + character_size if bot_beam_fire == True: if beam_hit_detector("bot"): #print "Agent Got Hit!" agent_hp -= beam_damage reward += -50 bot_beam_size_x = bot_beam_size_y = 0 bot_beam_x = bot_beam_y = beam_ob if agent_hp <= 0: cont = False winner = "Bot" if 1 <= agent_action <= 4: agent_beam_fire = True if agent_action == 1: if agent_y > disp_y/2 - arena_y/2: agent_beam_x = agent_x - beam_width/2; agent_beam_y = disp_y/2 - arena_y/2 agent_beam_size_x = beam_width; agent_beam_size_y = agent_y - disp_y/2 + arena_y/2 else: reward += -25 elif agent_action == 2: if agent_x + character_size < disp_x/2 + arena_x/2: agent_beam_x = agent_x + character_size; agent_beam_y = agent_y + character_size/2 - beam_width/2 agent_beam_size_x = disp_x/2 + arena_x/2 - agent_x - character_size; agent_beam_size_y = beam_width else: reward += -25 elif agent_action == 3: if agent_y + character_size < disp_y/2 + arena_y/2: agent_beam_x = agent_x + character_size/2 - beam_width/2; agent_beam_y = agent_y + character_size agent_beam_size_x = beam_width; agent_beam_size_y = disp_y/2 + arena_y/2 - agent_y - character_size else: reward += -25 elif agent_action == 4: if agent_x > disp_x/2 - arena_x/2: agent_beam_x = disp_x/2 - arena_x/2; agent_beam_y = agent_y + character_size/2 - beam_width/2 agent_beam_size_x = agent_x - disp_x/2 + arena_x/2; agent_beam_size_y = beam_width else: reward += -25 elif 5 <= agent_action <= 8: if agent_action == 5: agent_y -= character_move_speed if agent_y <= disp_y/2 - arena_y/2: agent_y = disp_y/2 - arena_y/2 reward += -5 elif bot_y <= agent_y <= bot_y + character_size and bot_x <= agent_x <= bot_x + character_size: agent_y = bot_y + character_size reward += -2 elif agent_action == 6: agent_x += character_move_speed if agent_x + character_size >= disp_x/2 + arena_x/2: agent_x = disp_x/2 + arena_x/2 - character_size reward += -5 elif bot_x <= agent_x + character_size <= bot_x + character_size and bot_y <= agent_y <= bot_y + character_size: agent_x = bot_x - character_size reward += -2 elif agent_action == 7: agent_y += character_move_speed if agent_y + character_size >= disp_y/2 + arena_y/2: agent_y = disp_y/2 + arena_y/2 - character_size reward += -5 elif bot_y <= agent_y + character_size <= bot_y + character_size and bot_x <= agent_x <= bot_x + character_size: agent_y = bot_y - character_size reward += -2 elif agent_action == 8: agent_x -= character_move_speed if agent_x <= disp_x/2 - arena_x/2: agent_x = disp_x/2 - arena_x/2 reward += -5 elif bot_x <= agent_x <= bot_x + character_size and bot_y <= agent_y <= bot_y + character_size: agent_x = bot_x + character_size reward += -2 if agent_beam_fire == True: if beam_hit_detector("agent"): #print "Bot Got Hit!" bot_hp -= beam_damage reward += 50 agent_beam_size_x = agent_beam_size_y = 0 agent_beam_x = agent_beam_y = beam_ob if bot_hp <= 0: successful = True cont = False winner = "Agent" return reward, cont, successful, winner def bot_beam_dir_detector(): global bot_current_action if bot_current_action == 1: bot_beam_dir = 2 elif bot_current_action == 2: bot_beam_dir = 4 elif bot_current_action == 3: bot_beam_dir = 3 elif bot_current_action == 4: bot_beam_dir = 1 else: bot_beam_dir = 0 return bot_beam_dir #Parameters y = 0.75 e = 0.3 num_episodes = 10000 batch_size = 10 complexity = 100 with tf.Session() as sess: sess.run(initialize) success = 0 for i in tqdm(range(1, num_episodes)): #print "Episode #", i rAll = 0; d = False; c = True; j = 0 param_init() samples = [] while c == True: j += 1 current_state = np.array([[mapping(complexity, float(agent_x) / float(arena_x)), mapping(complexity, float(agent_y) / float(arena_y)), mapping(complexity, float(bot_x) / float(arena_x)), mapping(complexity, float(bot_y) / float(arena_y)), #mapping(complexity, float(agent_hp) / float(character_init_health)), #mapping(complexity, float(bot_hp) / float(character_init_health)), mapping(complexity, float(agent_x - bot_x) / float(arena_x)), mapping(complexity, float(agent_y - bot_y) / float(arena_y)), bot_beam_dir ]]) b = bot_take_action() if np.random.rand(1) < e or i <= 5: a = random.randint(0, 8) else: a, _ = sess.run([predict, Q],feed_dict={input_layer : current_state}) r, c, d, winner = action(a + 1, b) bot_beam_dir = bot_beam_dir_detector() next_state = np.array([[mapping(complexity, float(agent_x) / float(arena_x)), mapping(complexity, float(agent_y) / float(arena_y)), mapping(complexity, float(bot_x) / float(arena_x)), mapping(complexity, float(bot_y) / float(arena_y)), #mapping(complexity, float(agent_hp) / float(character_init_health)), #mapping(complexity, float(bot_hp) / float(character_init_health)), mapping(complexity, float(agent_x - bot_x) / float(arena_x)), mapping(complexity, float(agent_y - bot_y) / float(arena_y)), bot_beam_dir ]]) samples.append([current_state, a, r, next_state]) if len(samples) > 10: for count in xrange(batch_size): [batch_current_state, action_taken, reward, batch_next_state] = samples[random.randint(0, len(samples) - 1)] batch_allQ = sess.run(Q, feed_dict={input_layer : batch_current_state}) batch_Q1 = sess.run(Q, feed_dict = {input_layer : batch_next_state}) batch_maxQ1 = np.max(batch_Q1) batch_targetQ = batch_allQ batch_targetQ[0][a] = reward + y * batch_maxQ1 sess.run([updateModel], feed_dict={input_layer : batch_current_state, next_Q : batch_targetQ}) rAll += r screen_blit() if d == True: e = 1. / ((i / 50) + 10) success += 1 break #print agent_hp, bot_hp display.update() jList.append(j) rList.append(rAll) print winner

I'm pretty sure that if you have pygame and Tensorflow and matplotlib installed in a python environment you should be able to see the animations of the bot and the agent "fighting".

I digressed in the update, but it would be awesome if somebody could also address my specific problem along with the original general problem.

Thanks!

Update #2 on August 18, 2017:

Based on the advice of @NeilSlater, I've implemented experience replay into my model. The algorithm has improved, but I'm going to look for more better improvement options that offer convergence.

Update #3 on August 22, 2017:

I've noticed that if the agent hits the bot with a bullet on a turn and the action the bot taken on that turn was not "fire a bullet", then the wrong actions would be given credit. Thus, I've turned the bullets into beams so the bot/agent takes damage on the turn the beam's fired.

Are you using experience replay and bootstrapping values from a "frozen" copy of recent network? These are approaches used in DQN - they are not guaranteed though they may be necessary for stability. Are you using a Q($\lambda$) algorithm, or just single-step Q-learning? Can you give some indication of what your environment and reward scheme is like? Single-step Q-learning will do poorly when rewards are sparse e.g. final +1 or -1 reward at end of long episode. — Neil Slater
– Neil Slater, Commented Aug 11, 2017 at 7:27
OK, from your update, I immediately suggest you need experience replay and probably also alternating networks for bootstrapping, because these are stabilising influences on reinforcement learning with non-linear approximators. I'm happy to talk through that in detail, and take a look at your project code to show an example, but might take a day or two to get back to you with that level of detail,. — Neil Slater
– Neil Slater, Commented Aug 11, 2017 at 15:41
I have got the code running and if I am correct in understanding it, the bullets can be "steered" by the agent selection from actions 1-4 each turn, i.e. the bullet can be moved around in any direction whilst the agent stays still. Is that intentional? The bot doesn't do this because it only fires when aligned on the grid to the agent, and always picks the same direction if it does so. — Neil Slater
– Neil Slater, Commented Aug 11, 2017 at 20:05
Almost right, but you don't store the bootstrapped value, instead re-calculate it when the step is sampled later. For each action taken, you store the four things: State, Action, Next State, Reward. Then you take a small mini-batch (1 per step is fine, but more e.g. 10 is typical) from this list and for Q-learning calculate the new max action and its value to create the supervised learning mini-batch (also called the TD target). — Neil Slater
– Neil Slater, Commented Aug 12, 2017 at 16:42
That should be "frozen copy of the approximator (i.e. the neural network" (if the quote is from one of my comments or answers, please point me at it and I will correct it. It's very simple - just keep two copies of the weight params $\mathbf{w}$, the "live" one that you update, and a "recent old" one that you copy from the "live" one every few hundred updates. When you calculate the TD target e.g. $R + \gamma \text{max}_{a'} \hat{q}(S',a',\mathbf{w})$ then use the "old" copy to calculate $\hat{q}$, but then train the "live" one with those values. — Neil Slater
– Neil Slater, Commented Aug 15, 2017 at 7:13

Stephen Rauch · Accepted Answer · 2018-03-30 20:40:24Z

If your weights are diverging then your optimizer or your gradients aren't behaving well. A common reason for diverging weights is exploding gradients, which can result from:

too many layers, or
too many recurrent cycles if you're using an RNN.

You can verify if you have exploding gradients as follows:

grad_magnitude = tf.reduce_sum([tf.reduce_sum(g**2) for g in tf.gradients(loss, weights_list)])**0.5

Some approaches to solving the problem of exploding gradients are:

Use RELU or ELU activations
Use Xavier initialization
Use a Deep Residual architecture. This will keep the gradients from being squished by subsequent layers.

Can this happen even with two hidden dense layer in a simple feedforward net? — Alexbrini
– Alexbrini, Commented Jul 25, 2020 at 0:23
Yes, your weights will diverge if the learning rate is too high — Default picture
– Default picture, Commented Jul 25, 2020 at 13:17

safetyduck · Accepted Answer · 2019-04-10 11:33:32Z

If you are using a fixed point iteration to solve Bellman, it might not only be degenerate but also might have attractors at infinity or orbits. Dig into the problem you are solving and understand it deeply. Have a look at control theory. RL folks tend not to write about this as much.

Stack Exchange Network

Why does Q Learning diverge?

2 Answers 2

Hot Network Questions

Why does Q Learning diverge?

2 Answers 2

Related

Hot Network Questions