Skip to main content
3 of 7
deleted 4 characters in body
nedward
  • 414
  • 5
  • 13

Why does Q Learning diverge?

My Q-Learning algorithm's state values keep on diverging to infinity, which means my weights are diverging too. I use a neural network for my value-mapping.

I've tried:

  • Clipping the "reward + discount * maximum value of action" (max/min set to 50/-50)
  • Setting a low learning rate (0.00001 and I use the classic Backpropagation for updating the weights)
  • Decreasing the values of the rewards
  • Increasing the exploration rate
  • Normalizing the inputs to between 1~100 (previously it was 0~1)
  • Change the discount rate
  • Decrease the layers of the neural network (just for validation)

I've heard that Q Learning is known to diverge on non-linear input, but are there anything else that I can try to stop the divergence of the weights?

Update:

I've decided to add some specific details on what I'm doing right now due to a request to.

I'm currently trying to make an agent learn how to fight in a top-down view of a shooting game. The opponent is a simple bot which moves stochastically.

Each character has 9 actions to choose from on each turn:

  • move up
  • move down
  • move left
  • move right
  • shoot a bullet upwards
  • shoot a bullet downwards
  • shoot a bullet to the left
  • shoot a bullet to the right
  • do nothing

The rewards are:

  • if agent hits the bot with a bullet, +100 (I've tried many different values)
  • if agent gets hit by a bullet shot by the bot, -10 (again, I've tried many different values)
  • else reward is 0

The inputs for the neural network are:

  • Distance between the agent and the bot on the X axis normalized to 0~100

  • Distance between the agent and the bot on the Y axis normalized to 0~100

  • agent health

  • bot health

I've also fiddled with the inputs too; I tried adding new features like the x value of the agent's position(not the distance but the actual position)and the position of the bot's bullet. None of them worked.

Here's the code:

from pygame import * from pygame.locals import * import sys from time import sleep import numpy as np import random import tensorflow as tf #Screen Setup disp_x, disp_y = 1500, 1000 arena_x, arena_y = 800, 800 border = 4; border_2 = 1 #Color Setup white = (255, 255, 255); aqua= (0, 200, 200); red = (255, 0, 0); green = (0, 255, 0); blue = (0, 0, 255); black = (0, 0, 0) #Initialize character positions init_character_a_state = [disp_x/2 - arena_x/2 + 50, disp_y/2 - arena_y/2 + 50] init_character_b_state = [disp_x/2 + arena_x/2 - 50, disp_y/2 + arena_y/2 - 50] #Setup character dimentions character_radius = 30 character_move_speed = 20 #Initialize character stats character_init_health = 100 #initialize bullet stats bullet_speed = 30 bullet_damage = 10 bullet_radius = 7 bullet_a_pos = list(init_character_a_state); bullet_b_pos = list(init_character_b_state) bullet_a_fire = False; bullet_b_fire = False #The Neural Network input_layer = tf.placeholder(shape=[1,4],dtype=tf.float32) weight_1 = tf.Variable(tf.random_uniform([4,9],0,0.1)) #The calculations, loss function and the update model Q = tf.matmul(input_layer, weight_1) predict = tf.argmax(Q, 1) next_Q = tf.placeholder(shape=[1,9],dtype=tf.float32) loss = tf.reduce_sum(tf.square(next_Q - Q)) trainer = tf.train.GradientDescentOptimizer(learning_rate=0.0001) updateModel = trainer.minimize(loss) initialize = tf.global_variables_initializer() #Parameters y = 0.75 e = 0.4 num_episodes = 10000 jList = [] rList = [] init() font.init() myfont = font.SysFont('Comic Sans MS', 15) myfont2 = font.SysFont('Comic Sans MS', 150) myfont3 = font.SysFont('Gothic', 30) disp = display.set_mode((disp_x, disp_y), 0, 32) def mapping(maximum, number): return int(abs(number * maximum)) def clip(value): if value > 50.0: value = 50.0 elif value < -50.0: value = -50.0 return value #Environment (Training) Parameters: agent_bullet_fire = bot_bullet_fire = False #The environment: def action(agent_x, agent_y, bot_x, bot_y, agent_action, bot_action, agent_hp, bot_hp, agent_bullet, bot_bullet): #Bullet Management reward = 0 cont = True successful = False winner = "" if 1 <= bot_action <= 4: #If bullet's fired by bot: if bot_action == 1: bot_bullet[1] -= bullet_speed elif bot_action == 2: bot_bullet[0] += bullet_speed elif bot_action == 3: bot_bullet[1] += bullet_speed elif bot_action == 4: bot_bullet[0] -= bullet_speed if bot_bullet[0] > disp_x/2 + arena_x/2 + bullet_radius or bot_bullet[0] < disp_x/2 - arena_x/2 - bullet_radius or bot_bullet[1] > disp_y/2 + arena_y/2 + bullet_radius or bot_bullet[1] < disp_y/2 - arena_y/2 - bullet_radius: bot_bullet_fire = False bot_bullet = [bot_x, bot_y] if agent_x - character_radius - border <= bot_bullet[0] <= agent_x + character_radius + border and agent_y - character_radius - border < bot_bullet[1] < agent_y + character_radius + border: agent_hp -= bullet_damage reward = -10 if agent_hp <= 0: cont = False winner = "Bot" if 5 <= bot_action <= 8: bot_bullet = [bot_x, bot_y] if bot_action == 5: bot_y -= character_move_speed if bot_y <= disp_y/2 - arena_y/2 + character_radius + 1: bot_y = disp_y/2 - arena_y/2 + character_radius + 1 elif bot_action == 6: bot_x += character_move_speed if bot_x >= disp_x/2 + arena_x/2 - character_radius - 1: bot_x = disp_x/2 + arena_x/2 - character_radius - 1 elif bot_action == 7: bot_y += character_move_speed if bot_y >= disp_y/2 + arena_y/2 - character_radius - 1: bot_y = disp_y/2 + arena_y/2 - character_radius - 1 elif bot_action == 8: bot_x -= character_move_speed if bot_x <= disp_x/2 - arena_x/2 + character_radius + 1: bot_x = disp_x/2 - arena_x/2 + character_radius + 1 if 1 <= agent_action <= 4: if agent_action == 1: agent_bullet[1] -= bullet_speed elif agent_action == 2: agent_bullet[0] += bullet_speed elif agent_action == 3: agent_bullet[1] += bullet_speed elif agent_action == 4: agent_bullet[0] -= bullet_speed if agent_bullet[0] > disp_x/2 + arena_x/2 + bullet_radius or agent_bullet[0] < disp_x/2 - arena_x/2 - bullet_radius or agent_bullet[1] > disp_y/2 + arena_y/2 + bullet_radius or agent_bullet[1] < disp_y/2 - arena_y/2 - bullet_radius: agent_bullet_fire = False agent_bullet = [agent_x, agent_y] if bot_x - character_radius <= agent_bullet[0] <= bot_x + character_radius and bot_y - character_radius < agent_bullet[1] < bot_y + character_radius: bot_hp -= bullet_damage reward = 100 if bot_hp <= 0: successful = True cont = False winner = "Agent" if 5 <= agent_action <= 8: agent_bullet = [agent_x, agent_y] if agent_action == 5: agent_y -= character_move_speed if agent_y <= disp_y/2 - arena_y/2 + character_radius + 1: agent_y = disp_y/2 - arena_y/2 + character_radius + 1 elif agent_action == 6: agent_x += character_move_speed if agent_x >= disp_x/2 + arena_x/2 - character_radius - 1: agent_x = disp_x/2 + arena_x/2 - character_radius - 1 elif agent_action == 7: agent_y += character_move_speed if agent_y >= disp_y/2 + arena_y/2 - character_radius - 1: agent_y = disp_y/2 + arena_y/2 - character_radius - 1 elif agent_action == 8: agent_x -= character_move_speed if agent_x <= disp_x/2 - arena_x/2 + character_radius + 1: agent_x = disp_x/2 - arena_x/2 + character_radius + 1 return reward, cont, successful, agent_x, agent_y, bot_x, bot_y, agent_hp, bot_hp, agent_bullet, bot_bullet, winner with tf.Session() as sess: sess.run(initialize) success = 0 for i in range(num_episodes): rAll = 0 s = 0 d = False agent_x = int(list(init_character_a_state)[0]); agent_y = int(list(init_character_a_state)[1]) bot_x = int(list(init_character_b_state)[0]); bot_y = int(list(init_character_b_state)[0]) agent_hp = bot_hp = int(character_init_health) bot_bullet = list(init_character_b_state); agent_bullet = list(init_character_a_state) j = 0 c = True while c == True: disp.fill(aqua) draw.rect(disp, black, (disp_x/2 - arena_x/2 - border, disp_y/2 - arena_y/2 - border, arena_x + border * 2, arena_y + border * 2)) draw.rect(disp, green, (disp_x/2 - arena_x/2, disp_y/2 - arena_y/2, arena_x, arena_y)) draw.circle(disp, black, agent_bullet, bullet_radius) draw.circle(disp, black, bot_bullet, bullet_radius) draw.circle(disp, black, (agent_x, agent_y), character_radius + border_2) draw.circle(disp, red, (agent_x, agent_y), character_radius) draw.circle(disp, black, (bot_x, bot_y), character_radius + border_2) draw.circle(disp, blue, (bot_x, bot_y), character_radius) draw.rect(disp, red, (disp_x / 2 - 200, disp_y / 2 + arena_y / 2 + border + 1, float(agent_hp)/float(character_init_health) * 100, 14)) draw.rect(disp, blue, (disp_x / 2 + 200, disp_y / 2 + arena_y / 2 + border + 1, float(bot_hp)/float(character_init_health) * 100, 14)) j += 1 """ ---CURRENT STATE--- Evenything will be on a scale of 0 to "complexity". (0 = 0, "complexity" = max) """ complexity = 100 current_state = np.array([[mapping(complexity, float(agent_x - bot_x)/float(arena_x)), mapping(complexity, float(agent_y - bot_y)/float(arena_y)), float(agent_hp)/float(character_init_health), float(bot_hp)/float(character_init_health)]]) #current_state is the array of parameters for feeding the neural network #print current_state a,allQ = sess.run([predict, Q],feed_dict={input_layer : current_state}) #bot move #1~4 are shooting a bullet. 5~8 are movement. 9 is doing nothing. if agent_x - character_radius <= bot_x <= agent_x + character_radius: if agent_y <= bot_y: if random.randint(0, 100) > 20: bot_action = 1 else: if random.randint(0, 100) > 20: bot_action = 3 elif agent_y - character_radius <= bot_y <= agent_y + character_radius: if agent_x <= bot_x: if random.randint(0, 100) > 20: bot_action = 4 else: if random.randint(0, 100) > 20: bot_action = 2 else: if random.randint(0, 100) > 20: #Find opponent, caluculate x and y distance and go the shortest way x_dist = abs(bot_x - agent_x); y_dist = abs(bot_y - agent_y) if x_dist >= y_dist: if bot_x - agent_x <= 0: bot_action = 6 else: bot_action = 8 else: if bot_y - agent_y <= 0: bot_action = 7 else: bot_action = 5 else: bot_action = random.randint(1, 9) if np.random.rand(1) < e: a[0] = random.randint(0,8) #Action: Takes positions and actions. r, c, d, new_agent_x, new_agent_y, new_bot_x, new_bot_y, new_agent_hp, new_bot_hp, new_agent_bullet, new_bot_bullet, winner = action(agent_x, agent_y, bot_x, bot_y, int(a[0]+1), bot_action, agent_hp, bot_hp, agent_bullet, bot_bullet) next_state = np.array([[mapping(complexity, float(new_agent_x - new_bot_x)/float(disp_x)), mapping(complexity, float(new_agent_y - new_bot_y)/float(disp_y)), new_agent_hp, new_bot_hp]]) Q1 = sess.run(Q, feed_dict = {input_layer : next_state}) maxQ1 = np.max(Q1) targetQ = allQ targetQ[0,a[0]] = clip(r + y * maxQ1) print targetQ #For Debugging sess.run([updateModel], feed_dict={input_layer : current_state, next_Q : targetQ}) rAll += r bot_x = new_bot_x; bot_y = new_bot_y; agent_x = new_agent_x; agent_y = new_agent_y; agent_hp = new_agent_hp; bot_hp = new_bot_hp; agent_bullet = new_agent_bullet; bot_bullet = new_bot_bullet if d == True: e = 1./((i/50) + 10) success += 1 break display.update() jList.append(j) rList.append(rAll) print winner plt.plot(rList) #plt.plot(jList) plt.show() 

I'm pretty sure that if you have pygame and Tensorflow and matplotlib installed in a python environment you should be able to see the animations of the bot and the agent "fighting".

I digressed in the update, but it would be awesome if somebody could also address my specific problem along with the original general problem.

Thanks!

nedward
  • 414
  • 5
  • 13