Skip to main content
5 of 7
Updated the code
nedward
  • 414
  • 5
  • 13

Why does Q Learning diverge?

My Q-Learning algorithm's state values keep on diverging to infinity, which means my weights are diverging too. I use a neural network for my value-mapping.

I've tried:

  • Clipping the "reward + discount * maximum value of action" (max/min set to 50/-50)
  • Setting a low learning rate (0.00001 and I use the classic Backpropagation for updating the weights)
  • Decreasing the values of the rewards
  • Increasing the exploration rate
  • Normalizing the inputs to between 1~100 (previously it was 0~1)
  • Change the discount rate
  • Decrease the layers of the neural network (just for validation)

I've heard that Q Learning is known to diverge on non-linear input, but are there anything else that I can try to stop the divergence of the weights?

Update:

I've decided to add some specific details on what I'm doing right now due to a request to.

I'm currently trying to make an agent learn how to fight in a top-down view of a shooting game. The opponent is a simple bot which moves stochastically.

Each character has 9 actions to choose from on each turn:

  • move up
  • move down
  • move left
  • move right
  • shoot a bullet upwards
  • shoot a bullet downwards
  • shoot a bullet to the left
  • shoot a bullet to the right
  • do nothing

The rewards are:

  • if agent hits the bot with a bullet, +100 (I've tried many different values)

  • if agent gets hit by a bullet shot by the bot, -50 (again, I've tried many different values)

  • if the agent tries to fire a bullet while bullets can't be fired(ex. when the agent just fired a bullet, etc. ), -25(Not necessary but I wanted the agent to be more efficient)

  • if the bot tries to go out of the arena, -20(Not necessary too but I wanted the agent to be more efficient)

The inputs for the neural network are:

  • Distance between the agent and the bot on the X axis normalized to 0~100

  • Distance between the agent and the bot on the Y axis normalized to 0~100

  • Agent's x and y positions

  • Bot's x and y positions

  • Bot's bullet position. If the bot didn't fire a bullet, the parameters are set to the x and y positions of the bot.

I've also fiddled with the inputs too; I tried adding new features like the x value of the agent's position(not the distance but the actual position)and the position of the bot's bullet. None of them worked.

Here's the code:

from pygame import * from pygame.locals import * import sys from time import sleep import numpy as np import random from time import sleep import tensorflow as tf #Screen Setup disp_x, disp_y = 1500, 1000 arena_x, arena_y = 800, 800 border = 4; border_2 = 1 #Color Setup white = (255, 255, 255); aqua= (0, 200, 200) red = (255, 0, 0); green = (0, 255, 0) blue = (0, 0, 255); black = (0, 0, 0) #Initialize character positions init_character_a_state = [disp_x/2 - arena_x/2 + 50, disp_y/2 - arena_y/2 + 50] init_character_b_state = [disp_x/2 + arena_x/2 - 50, disp_y/2 + arena_y/2 - 50] #Setup character dimentions character_radius = 30 character_move_speed = 20 #Initialize character stats character_init_health = 100 #initialize bullet stats bullet_speed = 50 bullet_damage = 10 bullet_radius = 7 bullet_a_pos = list(init_character_a_state); bullet_b_pos = list(init_character_b_state) bullet_a_fire = False; bullet_b_fire = False #The Neural Network input_layer = tf.placeholder(shape=[1,8],dtype=tf.float32) weight_1 = tf.Variable(tf.random_uniform([8,9],0,0.1)) #The calculations, loss function and the update model Q = tf.matmul(input_layer, weight_1) predict = tf.argmax(Q, 1) next_Q = tf.placeholder(shape=[1,9],dtype=tf.float32) loss = tf.reduce_sum(tf.square(next_Q - Q)) trainer = tf.train.GradientDescentOptimizer(learning_rate=0.0001) updateModel = trainer.minimize(loss) initialize = tf.global_variables_initializer() jList = [] rList = [] init() font.init() myfont = font.SysFont('Comic Sans MS', 15) myfont2 = font.SysFont('Comic Sans MS', 150) myfont3 = font.SysFont('Gothic', 30) disp = display.set_mode((disp_x, disp_y), 0, 32) bullet_ob = -100 #CHARACTER/BULLET PARAMETERS bot_bullet_x = bot_bullet_y = bullet_ob agent_bullet_x = agent_bullet_y = bullet_ob last_bot_bullet_x = last_bot_bullet_y = bullet_ob last_agent_bullet_x = last_agent_bullet_y = bullet_ob agent_bullet_fire = bot_bullet_fire = bool() agent_bullet_direction_x = agent_bullet_direction_y = int() bot_bullet_direction_x = bot_bullet_direction_y = int() agent_x = agent_y = int() bot_x = bot_y = int() agent_hp = bot_hp = int() def param_init(): """Initializes parameters""" global bot_bullet_x, bot_bullet_y, agent_bullet_x, agent_bullet_y, agent_bullet_fire, \ bot_bullet_fire, agent_bullet_direction_x, agent_bullet_direction_y, bot_bullet_direction_x, \ bot_bullet_direction_y, agent_x, agent_y, bot_x, bot_y, agent_hp, bot_hp agent_bullet_x = agent_bullet_y = bullet_ob bot_bullet_x = bot_bullet_y = bullet_ob last_agent_bullet_x = last_agent_bullet_y = bullet_ob last_bot_bullet_x = last_bot_bullet_y = bullet_ob agent_bullet_fire = bot_bullet_fire = False agent_bullet_direction_x = 0; agent_bullet_direction_y = 0 bot_bullet_direction_x = 0; bot_bullet_direction_y = 0 agent_x = list(init_character_a_state)[0]; agent_y = list(init_character_a_state)[1] bot_x = list(init_character_b_state)[0]; bot_y = list(init_character_b_state)[1] agent_hp = bot_hp = character_init_health def screen_blit(): global disp, disp_x, disp_y, arena_x, arena_y, border, border_2, agent_bullet_x, \ agent_bullet_y, bullet_radius, bot_bullet_x, bot_bullet_y, character_radius, agent_x, \ agent_y, bot_x, bot_y, character_init_health, agent_hp, bot_hp, red, blue, aqua, green, black disp.fill(aqua) draw.rect(disp, black, (disp_x / 2 - arena_x / 2 - border, disp_y / 2 - arena_y / 2 - border, arena_x + border * 2, arena_y + border * 2)) draw.rect(disp, green, (disp_x / 2 - arena_x / 2, disp_y / 2 - arena_y / 2, arena_x, arena_y)) draw.circle(disp, black, [agent_bullet_x, agent_bullet_y], bullet_radius) draw.circle(disp, black, [bot_bullet_x, bot_bullet_y], bullet_radius) draw.circle(disp, black, (agent_x, agent_y), character_radius + border_2) draw.circle(disp, red, (agent_x, agent_y), character_radius) draw.circle(disp, black, (bot_x, bot_y), character_radius + border_2) draw.circle(disp, blue, (bot_x, bot_y), character_radius) draw.rect(disp, red, (disp_x / 2 - 200, disp_y / 2 + arena_y / 2 + border + 1, float(agent_hp) / float(character_init_health) * 100, 14)) draw.rect(disp, blue, (disp_x / 2 + 200, disp_y / 2 + arena_y / 2 + border + 1, float(bot_hp) / float(character_init_health) * 100, 14)) def bot_take_action(): global agent_x, agent_y, bot_x, bot_y, character_radius, bot_action, border if agent_x - character_radius - border <= bot_x <= agent_x + character_radius + border: if random.randint(0, 100) > 5: if agent_y <= bot_y: bot_action = 1 else: bot_action = 3 else: bot_action = 9 elif agent_y - character_radius <= bot_y <= agent_y + character_radius: if random.randint(0, 100) > 5: if agent_x <= bot_x: bot_action = 4 else: bot_action = 2 else: bot_action = 9 else: if random.randint(0, 100) > 5: x_dist = abs(bot_x - agent_x); y_dist = abs(bot_y - agent_y) if x_dist >= y_dist: if bot_x - agent_x <= 0: bot_action = 6 else: bot_action = 8 else: if bot_y - agent_y <= 0: bot_action = 7 else: bot_action = 5 else: bot_action = random.randint(1, 9) def bullet_hit_detector(player): global bot_bullet_x, bot_bullet_y, last_bot_bullet_x, last_bot_bullet_y, agent_x, agent_y, last_agent_bullet_x, last_agent_bullet_y, character_radius, border, bullet_radius if player == "bot": if bot_bullet_x == last_bot_bullet_x: if agent_x - character_radius - border < bot_bullet_x + bullet_radius < agent_x + character_radius + border or \ agent_x - character_radius - border < bot_bullet_x - bullet_radius < agent_x + character_radius + border: #If the current state of the bullet is touching/inside the agent: if agent_y - character_radius - border < bot_bullet_y + bullet_radius < agent_y + character_radius or \ agent_y - character_radius < bot_bullet_y - bullet_radius < agent_y + character_radius + border: return True #If the bullet "passed through" the character from the last turn: elif (last_bot_bullet_y - bullet_radius > agent_y + character_radius + border and agent_y - character_radius - border > bot_bullet_y + bullet_radius) \ or (bot_bullet_y - bullet_radius > agent_y + character_radius + border and agent_y - character_radius - border > last_bot_bullet_y + bullet_radius): return True else: return False elif bot_bullet_y == last_bot_bullet_y: if agent_y - character_radius - border < bot_bullet_y - bullet_radius < agent_y + character_radius + border or \ agent_y - character_radius - border < bot_bullet_y + bullet_radius < agent_y + character_radius + border: #If the current state of the bullet is touching/inside the agent: if agent_x - character_radius - border < bot_bullet_x + bullet_radius < agent_x + character_radius or \ agent_x - character_radius < bot_bullet_x - bullet_radius < agent_x + character_radius + border: return True #If the bullet "passed through" the character from the last turn: elif (last_bot_bullet_x - bullet_radius > agent_x + character_radius + border and agent_x - character_radius - border > bot_bullet_x + bullet_radius) \ or (bot_bullet_x - bullet_radius > agent_x + character_radius + border and agent_x - character_radius - border > last_bot_bullet_x + bullet_radius): return True else: return False else: if agent_bullet_x == last_agent_bullet_x: if bot_x - character_radius - border < agent_bullet_x + bullet_radius < bot_x + character_radius + border or \ bot_x - character_radius - border < agent_bullet_x - bullet_radius < bot_x + character_radius + border: #If the current state of the bullet is touching/inside the agent: if bot_y - character_radius - border < agent_bullet_y + bullet_radius < bot_y + character_radius or \ bot_y - character_radius < agent_bullet_y - bullet_radius < bot_y + character_radius + border: return True #If the bullet "passed through" the character from the last turn: elif (last_agent_bullet_y - bullet_radius > bot_y + character_radius + border and bot_y - character_radius - border > agent_bullet_y + bullet_radius) \ or (agent_bullet_y - bullet_radius > bot_y + character_radius + border and bot_y - character_radius - border > last_agent_bullet_y + bullet_radius): return True else: return False elif agent_bullet_y == last_agent_bullet_y: if bot_y - character_radius - border < agent_bullet_y - bullet_radius < bot_y + character_radius + border or \ bot_y - character_radius - border < agent_bullet_y + bullet_radius < bot_y + character_radius + border: #If the current state of the bullet is touching/inside the agent: if bot_x - character_radius - border < agent_bullet_x + bullet_radius < bot_x + character_radius or \ bot_x - character_radius < agent_bullet_x - bullet_radius < bot_x + character_radius + border: return True #If the bullet "passed through" the character from the last turn: elif (last_agent_bullet_x - bullet_radius > bot_x + character_radius + border and bot_x - character_radius - border > agent_bullet_x + bullet_radius) \ or (agent_bullet_x - bullet_radius > bot_x + character_radius + border and bot_x - character_radius - border > last_agent_bullet_x + bullet_radius): return True else: return False def mapping(maximum, number): return int(abs(number * maximum) / (maximum/10)) def action(agent_action, bot_action): global bot_bullet_x, bot_bullet_y, agent_bullet_x, agent_bullet_y, agent_bullet_fire, \ bot_bullet_fire, agent_bullet_direction_x, agent_bullet_direction_y, bot_bullet_direction_x, \ bot_bullet_direction_y, agent_x, agent_y, bot_x, bot_y, agent_hp, bot_hp, last_agent_bullet_x, last_agent_bullet_y, last_bot_bullet_x, last_bot_bullet_y reward = 0; cont = True; successful = False; winner = "" if 1 <= bot_action <= 4 and bot_bullet_fire == False: bot_bullet_fire = True if bot_action == 1: bot_bullet_direction_x = 0; bot_bullet_direction_y = -bullet_speed elif bot_action == 2: bot_bullet_direction_x = bullet_speed; bot_bullet_direction_y = 0 elif bot_action == 3: bot_bullet_direction_x = 0; bot_bullet_direction_y = bullet_speed elif bot_action == 4: bot_bullet_direction_x = -bullet_speed; bot_bullet_direction_y = 0 bot_bullet_x = bot_x + bot_bullet_direction_x; bot_bullet_y = bot_y + bot_bullet_direction_y elif 5 <= bot_action <= 8: if bot_action == 5: bot_y -= character_move_speed if bot_y <= disp_y/2 - arena_y/2 + character_radius + 1: bot_y = disp_y/2 - arena_y/2 + character_radius + 1 elif bot_action == 6: bot_x += character_move_speed if bot_x >= disp_x/2 + arena_x/2 - character_radius - 1: bot_x = disp_x/2 + arena_x/2 - character_radius - 1 elif bot_action == 7: bot_y += character_move_speed if bot_y >= disp_y/2 + arena_y/2 - character_radius - 1: bot_y = disp_y/2 + arena_y/2 - character_radius - 1 elif bot_action == 8: bot_x -= character_move_speed if bot_x <= disp_x/2 - arena_x/2 + character_radius + 1: bot_x = disp_x/2 - arena_x/2 + character_radius + 1 if bot_bullet_fire == True: last_bot_bullet_x = bot_bullet_x; last_bot_bullet_y = bot_bullet_y bot_bullet_x += bot_bullet_direction_x; bot_bullet_y += bot_bullet_direction_y if bullet_hit_detector("bot"): print "Agent Got Hit!" agent_hp -= bullet_damage reward = -50 bot_bullet_fire = False bot_bullet_direction_x = 0; bot_bullet_direction_y = 0 bot_bullet_x = bot_bullet_y = bullet_ob; last_bot_bullet_x = last_bot_bullet_y = bullet_ob if agent_hp <= 0: cont = False winner = "Bot" elif bot_bullet_x + bullet_radius >= disp_x/2 + arena_x/2 or bot_bullet_x - bullet_radius <= disp_x/2 - arena_x/2 or \ bot_bullet_y + bullet_radius >= disp_y/2 + arena_y/2 or bot_bullet_y - bullet_radius <= disp_y/2 - arena_y/2: bot_bullet_fire = False bot_bullet_direction_x = 0; bot_bullet_direction_y = 0 bot_bullet_x = bot_bullet_y = bullet_ob; last_bot_bullet_x = last_bot_bullet_y = bullet_ob if 1 <= agent_action <= 4: if agent_bullet_fire == False: agent_bullet_fire = True if agent_action == 1: if agent_y - character_radius - border > disp_y/2 - arena_y/2: agent_bullet_direction_x = 0; agent_bullet_direction_y = -bullet_speed reward = 10 else: reward = -25 agent_bullet_x = agent_bullet_y = bullet_ob agent_bullet_fire = False elif agent_action == 2: if agent_x + character_radius + border < disp_x/2 + arena_x/2: agent_bullet_direction_x = bullet_speed; agent_bullet_direction_y = 0 reward = 10 else: reward = -25 agent_bullet_x = agent_bullet_y = bullet_ob agent_bullet_fire = False elif agent_action == 3: if agent_y + character_radius + border < disp_y/2 + arena_y/2: agent_bullet_direction_x = 0; agent_bullet_direction_y = bullet_speed reward = 10 else: reward = -25 agent_bullet_x = agent_bullet_y = bullet_ob agent_bullet_fire = False elif agent_action == 4: if agent_x - character_radius - border > disp_x/2 - arena_x/2: agent_bullet_direction_x = -bullet_speed; agent_bullet_direction_y = 0 reward = 10 else: reward = -25 agent_bullet_x = agent_bullet_y = bullet_ob agent_bullet_fire = False if agent_bullet_fire == True: agent_bullet_x = agent_x + agent_bullet_direction_x; agent_bullet_y = agent_y + agent_bullet_direction_y last_agent_bullet_x = agent_bullet_x; last_agent_bullet_y = agent_bullet_y else: reward = -20 elif 5 <= agent_action <= 8: if agent_action == 5: agent_y -= character_move_speed if agent_y - character_radius - border <= disp_y/2 - arena_y/2: agent_y = disp_y/2 - arena_y/2 + character_radius + border reward = -5 else: reward = 5 elif agent_action == 6: agent_x += character_move_speed if agent_x + character_radius + border >= disp_x/2 + arena_x/2: agent_x = disp_x/2 + arena_x/2 - character_radius - border reward = -5 else: reward = 5 elif agent_action == 7: agent_y += character_move_speed if agent_y + character_radius + border >= disp_y/2 + arena_y/2: agent_y = disp_y/2 + arena_y/2 - character_radius - border reward = -5 else: reward = 5 elif agent_action == 8: agent_x -= character_move_speed if agent_x - character_radius - border <= disp_x/2 - arena_x/2: agent_x = disp_x/2 - arena_x/2 + character_radius + border reward = -5 else: reward = 5 if agent_bullet_fire == True: last_agent_bullet_x = agent_bullet_x; last_agent_bullet_y = agent_bullet_y agent_bullet_x += agent_bullet_direction_x; agent_bullet_y += agent_bullet_direction_y if bullet_hit_detector("agent"): print "Bot Got Hit!" bot_hp -= bullet_damage reward = 100 agent_bullet_fire = False agent_bullet_direction_x = 0; agent_bullet_direction_y = 0 agent_bullet_x = agent_bullet_y = bullet_ob; last_agent_bullet_x = last_agent_bullet_y = bullet_ob if bot_hp <= 0: successful = True cont = False winner = "Agent" elif agent_bullet_x + bullet_radius >= disp_x/2 + arena_x/2 or agent_bullet_x - bullet_radius <= disp_x/2 - arena_x/2 or \ agent_bullet_y + bullet_radius >= disp_y/2 + arena_y/2 or agent_bullet_y - bullet_radius <= disp_y/2 - arena_y/2: agent_bullet_fire = False agent_bullet_direction_x = 0; agent_bullet_direction_y = 0 agent_bullet_x = agent_bullet_y = bullet_ob; last_agent_bullet_x = last_agent_bullet_y = bullet_ob return reward, cont, successful, winner #Parameters y = 0.75 e = 0.3 num_episodes = 10000 batch_size = 10 complexity = 10 with tf.Session() as sess: sess.run(initialize) success = 0 for i in range(1, num_episodes): rAll = 0; d = False; c = True; j = 0 param_init() samples = [] while c == True: j += 1 screen_blit() current_state = np.array([[mapping(complexity, float(agent_x) / float(arena_x)), mapping(complexity, float(agent_y) / float(arena_y)), mapping(complexity, float(bot_x) / float(arena_x)), mapping(complexity, float(bot_y) / float(arena_y)), mapping(complexity, float(bot_bullet_x / float(arena_x))), mapping(complexity, float(bot_bullet_y / float(arena_y))), mapping(complexity, abs(float(agent_x - bot_x)) / float(arena_x)), mapping(complexity, abs(float(agent_y - bot_y)) / float(arena_y))]]) bot_take_action() if np.random.rand(1) < e or i <= 5: a = random.randint(0, 8) else: a, _ = sess.run([predict, Q],feed_dict={input_layer : current_state}) r, c, d, winner = action(a + 1, bot_action) next_state = np.array([[mapping(complexity, float(agent_x) / float(arena_x)), mapping(complexity, float(agent_y) / float(arena_y)), mapping(complexity, float(bot_x) / float(arena_x)), mapping(complexity, float(bot_y) / float(arena_y)), mapping(complexity, float(bot_bullet_x / float(arena_x))), mapping(complexity, float(bot_bullet_y / float(arena_y))), mapping(complexity, abs(float(agent_x - bot_x)) / float(arena_x)), mapping(complexity, abs(float(agent_y - bot_y)) / float(arena_y))]]) samples.append([current_state, a, r, next_state]) if len(samples) > 10: for count in xrange(batch_size): [batch_current_state, action_taken, reward, batch_next_state] = samples[random.randint(0, len(samples) - 1)] batch_allQ = sess.run(Q, feed_dict={input_layer : batch_current_state}) batch_Q1 = sess.run(Q, feed_dict = {input_layer : batch_next_state}) batch_maxQ1 = np.max(batch_Q1) batch_targetQ = batch_allQ batch_targetQ[0][a] = reward + y * batch_maxQ1 sess.run([updateModel], feed_dict={input_layer : batch_current_state, next_Q : batch_targetQ}) rAll += r if d == True: e = 1. / ((i / 50) + 10) success += 1 break display.update() rList.append(rAll) print winner print "Successful episodes: %d out of %d. Success Rate = %d" % (success, num_episodes, float(success)/float(num_episodes)) plt.plot(rList) plt.show() 

I'm pretty sure that if you have pygame and Tensorflow and matplotlib installed in a python environment you should be able to see the animations of the bot and the agent "fighting".

I digressed in the update, but it would be awesome if somebody could also address my specific problem along with the original general problem.

Thanks!

nedward
  • 414
  • 5
  • 13