My Q-Learning algorithm's state values keep on diverging to infinity, which means my weights are diverging too. I use a neural network for my value-mapping.
I've tried:
- Clipping the "reward + discount * maximum value of action" (max/min set to 50/-50)
- Setting a low learning rate (0.00001 and I use the classic Backpropagation for updating the weights)
- Decreasing the values of the rewards
- Increasing the exploration rate
- Normalizing the inputs to between 1~100 (previously it was 0~1)
- Change the discount rate
- Decrease the layers of the neural network (just for validation)
I've heard that Q Learning is known to diverge on non-linear input, but are there anything else that I can try to stop the divergence of the weights?
Update #1 on August 14th, 2017:
I've decided to add some specific details on what I'm doing right now due to a request to.
I'm currently trying to make an agent learn how to fight in a top-down view of a shooting game. The opponent is a simple bot which moves stochastically.
Each character has 9 actions to choose from on each turn:
- move up
- move down
- move left
- move right
- shoot a bullet upwards
- shoot a bullet downwards
- shoot a bullet to the left
- shoot a bullet to the right
- do nothing
The rewards are:
- if agent hits the bot with a bullet, +100 (I've tried many different values)
- if agent gets hit by a bullet shot by the bot, -50 (again, I've tried many different values)
if the agent tries to fire a bullet while bullets can't be fired(ex. when the agent just fired a bullet, etc. ), -25(Not necessary but I wanted the agent to be more efficient)
if the bot tries to go out of the arena, -20(Not necessary too but I wanted the agent to be more efficient)
The inputs for the neural network are:
Distance between the agent and the bot on the X axis normalized to 0~100
Distance between the agent and the bot on the Y axis normalized to 0~100
Agent's x and y positions
Bot's x and y positions
Bot's bullet position. If the bot didn't fire a bullet, the parameters are set to the x and y positions of the bot.
I've also fiddled with the inputs too; I tried adding new features like the x value of the agent's position(not the distance but the actual position)and the position of the bot's bullet. None of them worked.
Here's the code:
from pygame import * from pygame.locals import * import sys from time import sleep import numpy as np import random import tensorflow as tf from pylab import savefig from tqdm import tqdm #Screen Setup disp_x, disp_y = 1000, 800 arena_x, arena_y = 1000, 800 border = 4; border_2 = 1 #Color Setup white = (255, 255, 255); aqua= (0, 200, 200) red = (255, 0, 0); green = (0, 255, 0) blue = (0, 0, 255); black = (0, 0, 0) green_yellow = (173, 255, 47); energy_blue = (125, 249, 255) #Initialize character positions init_character_a_state = [disp_x/2 - arena_x/2 + 50, disp_y/2 - arena_y/2 + 50] init_character_b_state = [disp_x/2 + arena_x/2 - 50, disp_y/2 + arena_y/2 - 50] #Setup character dimentions character_size = 50 character_move_speed = 25 #Initialize character stats character_init_health = 100 #initialize bullet stats beam_damage = 10 beam_width = 10 beam_ob = -100 #The Neural Network input_layer = tf.placeholder(shape=[1,7],dtype=tf.float32) weight_1 = tf.Variable(tf.random_uniform([7,9],0,0.1)) #weight_2 = tf.Variable(tf.random_uniform([6,9],0,0.1)) #The calculations, loss function and the update model Q = tf.matmul(input_layer, weight_1) predict = tf.argmax(Q, 1) next_Q = tf.placeholder(shape=[1,9],dtype=tf.float32) loss = tf.reduce_sum(tf.square(next_Q - Q)) trainer = tf.train.GradientDescentOptimizer(learning_rate=0.001) updateModel = trainer.minimize(loss) initialize = tf.global_variables_initializer() jList = [] rList = [] init() font.init() myfont = font.SysFont('Comic Sans MS', 15) myfont2 = font.SysFont('Comic Sans MS', 150) myfont3 = font.SysFont('Gothic', 30) disp = display.set_mode((disp_x, disp_y), 0, 32) #CHARACTER/BULLET PARAMETERS agent_x = agent_y = int() bot_x = bot_y = int() agent_hp = bot_hp = int() bot_beam_dir = int() agent_beam_fire = bot_beam_fire = bool() agent_beam_x = bot_beam_x = agent_beam_y = bot_beam_y = int() agent_beam_size_x = agent_beam_size_y = bot_beam_size_x = bot_beam_size_y = int() bot_current_action = agent_current_action = int() def param_init(): """Initializes parameters""" global agent_x, agent_y, bot_x, bot_y, agent_hp, bot_hp, agent_beam_fire, bot_beam_fire, agent_beam_x, bot_beam_x, agent_beam_y, bot_beam_y agent_x = list(init_character_a_state)[0]; agent_y = list(init_character_a_state)[1] bot_x = list(init_character_b_state)[0]; bot_y = list(init_character_b_state)[1] agent_hp = bot_hp = character_init_health agent_beam_fire = bot_beam_fire = False agent_beam_x = bot_beam_x = agent_beam_y = bot_beam_y = beam_ob agent_beam_size_x = agent_beam_size_y = bot_beam_size_x = bot_beam_size_y = 0 def screen_blit(): global disp, disp_x, disp_y, arena_x, arena_y, border, border_2, character_size, agent_x, \ agent_y, bot_x, bot_y, character_init_health, agent_hp, bot_hp, red, blue, aqua, green, black, green_yellow, energy_blue, \ agent_beam_fire, bot_beam_fire, agent_beam_x, agent_beam_y, bot_beam_x, bot_beam_y, agent_beam_size_x, agent_beam_size_y, bot_beam_size_x, bot_beam_size_y, beam_width disp.fill(aqua) draw.rect(disp, black, (disp_x / 2 - arena_x / 2 - border, disp_y / 2 - arena_y / 2 - border, arena_x + border * 2, arena_y + border * 2)) draw.rect(disp, green, (disp_x / 2 - arena_x / 2, disp_y / 2 - arena_y / 2, arena_x, arena_y)) if bot_beam_fire == True: draw.rect(disp, green_yellow, (agent_beam_x, agent_beam_y, agent_beam_size_x, agent_beam_size_y)) bot_beam_fire = False if agent_beam_fire == True: draw.rect(disp, energy_blue, (bot_beam_x, bot_beam_y, bot_beam_size_x, bot_beam_size_y)) agent_beam_fire = False draw.rect(disp, red, (agent_x, agent_y, character_size, character_size)) draw.rect(disp, blue, (bot_x, bot_y, character_size, character_size)) draw.rect(disp, red, (disp_x / 2 - 200, disp_y / 2 + arena_y / 2 + border + 1, float(agent_hp) / float(character_init_health) * 100, 14)) draw.rect(disp, blue, (disp_x / 2 + 200, disp_y / 2 + arena_y / 2 + border + 1, float(bot_hp) / float(character_init_health) * 100, 14)) def bot_take_action(): return random.randint(1, 9) def beam_hit_detector(player): global agent_x, agent_y, bot_x, bot_y, agent_beam_fire, bot_beam_fire, agent_beam_x, \ bot_beam_x, agent_beam_y, bot_beam_y, agent_beam_size_x, agent_beam_size_y, \ bot_beam_size_x, bot_beam_size_y, bot_current_action, agent_current_action, beam_width, character_size if player == "bot": if bot_current_action == 1: if disp_y/2 - arena_y/2 <= agent_y <= bot_y and (agent_x < bot_beam_x + beam_width < agent_x + character_size or agent_x < bot_beam_x < agent_x + character_size): return True else: return False elif bot_current_action == 2: if bot_x <= agent_x <= disp_x/2 + arena_x/2 and (agent_y < bot_beam_y + beam_width < agent_y + character_size or agent_y < bot_beam_y < agent_y + character_size): return True else: return False elif bot_current_action == 3: if bot_y <= agent_y <= disp_y/2 + arena_y/2 and (agent_x < bot_beam_x + beam_width < agent_x + character_size or agent_x < bot_beam_x < agent_x + character_size): return True else: return False elif bot_current_action == 4: if disp_x/2 - arena_x/2 <= agent_x <= bot_x and (agent_y < bot_beam_y + beam_width < agent_y + character_size or agent_y < bot_beam_y < agent_y + character_size): return True else: return False else: if agent_current_action == 1: if disp_y/2 - arena_y/2 <= bot_y <= agent_y and (bot_x < agent_beam_x + beam_width < bot_x + character_size or bot_x < agent_beam_x < bot_x + character_size): return True else: return False elif agent_current_action == 2: if agent_x <= bot_x <= disp_x/2 + arena_x/2 and (bot_y < agent_beam_y + beam_width < bot_y + character_size or bot_y < agent_beam_y < bot_y + character_size): return True else: return False elif agent_current_action == 3: if agent_y <= bot_y <= disp_y/2 + arena_y/2 and (bot_x < agent_beam_x + beam_width < bot_x + character_size or bot_x < agent_beam_x < bot_x + character_size): return True else: return False elif bot_current_action == 4: if disp_x/2 - arena_x/2 <= bot_x <= agent_x and (bot_y < agent_beam_y + beam_width < bot_y + character_size or bot_y < agent_beam_y < bot_y + character_size): return True else: return False def mapping(maximum, number): return number#int(number * maximum) def action(agent_action, bot_action): global agent_x, agent_y, bot_x, bot_y, agent_hp, bot_hp, agent_beam_fire, \ bot_beam_fire, agent_beam_x, bot_beam_x, agent_beam_y, bot_beam_y, agent_beam_size_x, \ agent_beam_size_y, bot_beam_size_x, bot_beam_size_y, beam_width, agent_current_action, bot_current_action, character_size agent_current_action = agent_action; bot_current_action = bot_action reward = 0; cont = True; successful = False; winner = "" if 1 <= bot_action <= 4: bot_beam_fire = True if bot_action == 1: bot_beam_x = bot_x + character_size/2 - beam_width/2; bot_beam_y = disp_y/2 - arena_y/2 bot_beam_size_x = beam_width; bot_beam_size_y = bot_y - disp_y/2 + arena_y/2 elif bot_action == 2: bot_beam_x = bot_x + character_size; bot_beam_y = bot_y + character_size/2 - beam_width/2 bot_beam_size_x = disp_x/2 + arena_x/2 - bot_x - character_size; bot_beam_size_y = beam_width elif bot_action == 3: bot_beam_x = bot_x + character_size/2 - beam_width/2; bot_beam_y = bot_y + character_size bot_beam_size_x = beam_width; bot_beam_size_y = disp_y/2 + arena_y/2 - bot_y - character_size elif bot_action == 4: bot_beam_x = disp_x/2 - arena_x/2; bot_beam_y = bot_y + character_size/2 - beam_width/2 bot_beam_size_x = bot_x - disp_x/2 + arena_x/2; bot_beam_size_y = beam_width elif 5 <= bot_action <= 8: if bot_action == 5: bot_y -= character_move_speed if bot_y <= disp_y/2 - arena_y/2: bot_y = disp_y/2 - arena_y/2 elif agent_y <= bot_y <= agent_y + character_size: bot_y = agent_y + character_size elif bot_action == 6: bot_x += character_move_speed if bot_x >= disp_x/2 + arena_x/2 - character_size: bot_x = disp_x/2 + arena_x/2 - character_size elif agent_x <= bot_x + character_size <= agent_x + character_size: bot_x = agent_x - character_size elif bot_action == 7: bot_y += character_move_speed if bot_y + character_size >= disp_y/2 + arena_y/2: bot_y = disp_y/2 + arena_y/2 - character_size elif agent_y <= bot_y + character_size <= agent_y + character_size: bot_y = agent_y - character_size elif bot_action == 8: bot_x -= character_move_speed if bot_x <= disp_x/2 - arena_x/2: bot_x = disp_x/2 - arena_x/2 elif agent_x <= bot_x <= agent_x + character_size: bot_x = agent_x + character_size if bot_beam_fire == True: if beam_hit_detector("bot"): #print "Agent Got Hit!" agent_hp -= beam_damage reward += -50 bot_beam_size_x = bot_beam_size_y = 0 bot_beam_x = bot_beam_y = beam_ob if agent_hp <= 0: cont = False winner = "Bot" if 1 <= agent_action <= 4: agent_beam_fire = True if agent_action == 1: if agent_y > disp_y/2 - arena_y/2: agent_beam_x = agent_x - beam_width/2; agent_beam_y = disp_y/2 - arena_y/2 agent_beam_size_x = beam_width; agent_beam_size_y = agent_y - disp_y/2 + arena_y/2 else: reward += -25 elif agent_action == 2: if agent_x + character_size < disp_x/2 + arena_x/2: agent_beam_x = agent_x + character_size; agent_beam_y = agent_y + character_size/2 - beam_width/2 agent_beam_size_x = disp_x/2 + arena_x/2 - agent_x - character_size; agent_beam_size_y = beam_width else: reward += -25 elif agent_action == 3: if agent_y + character_size < disp_y/2 + arena_y/2: agent_beam_x = agent_x + character_size/2 - beam_width/2; agent_beam_y = agent_y + character_size agent_beam_size_x = beam_width; agent_beam_size_y = disp_y/2 + arena_y/2 - agent_y - character_size else: reward += -25 elif agent_action == 4: if agent_x > disp_x/2 - arena_x/2: agent_beam_x = disp_x/2 - arena_x/2; agent_beam_y = agent_y + character_size/2 - beam_width/2 agent_beam_size_x = agent_x - disp_x/2 + arena_x/2; agent_beam_size_y = beam_width else: reward += -25 elif 5 <= agent_action <= 8: if agent_action == 5: agent_y -= character_move_speed if agent_y <= disp_y/2 - arena_y/2: agent_y = disp_y/2 - arena_y/2 reward += -5 elif bot_y <= agent_y <= bot_y + character_size and bot_x <= agent_x <= bot_x + character_size: agent_y = bot_y + character_size reward += -2 elif agent_action == 6: agent_x += character_move_speed if agent_x + character_size >= disp_x/2 + arena_x/2: agent_x = disp_x/2 + arena_x/2 - character_size reward += -5 elif bot_x <= agent_x + character_size <= bot_x + character_size and bot_y <= agent_y <= bot_y + character_size: agent_x = bot_x - character_size reward += -2 elif agent_action == 7: agent_y += character_move_speed if agent_y + character_size >= disp_y/2 + arena_y/2: agent_y = disp_y/2 + arena_y/2 - character_size reward += -5 elif bot_y <= agent_y + character_size <= bot_y + character_size and bot_x <= agent_x <= bot_x + character_size: agent_y = bot_y - character_size reward += -2 elif agent_action == 8: agent_x -= character_move_speed if agent_x <= disp_x/2 - arena_x/2: agent_x = disp_x/2 - arena_x/2 reward += -5 elif bot_x <= agent_x <= bot_x + character_size and bot_y <= agent_y <= bot_y + character_size: agent_x = bot_x + character_size reward += -2 if agent_beam_fire == True: if beam_hit_detector("agent"): #print "Bot Got Hit!" bot_hp -= beam_damage reward += 50 agent_beam_size_x = agent_beam_size_y = 0 agent_beam_x = agent_beam_y = beam_ob if bot_hp <= 0: successful = True cont = False winner = "Agent" return reward, cont, successful, winner def bot_beam_dir_detector(): global bot_current_action if bot_current_action == 1: bot_beam_dir = 2 elif bot_current_action == 2: bot_beam_dir = 4 elif bot_current_action == 3: bot_beam_dir = 3 elif bot_current_action == 4: bot_beam_dir = 1 else: bot_beam_dir = 0 return bot_beam_dir #Parameters y = 0.75 e = 0.3 num_episodes = 10000 batch_size = 10 complexity = 100 with tf.Session() as sess: sess.run(initialize) success = 0 for i in tqdm(range(1, num_episodes)): #print "Episode #", i rAll = 0; d = False; c = True; j = 0 param_init() samples = [] while c == True: j += 1 current_state = np.array([[mapping(complexity, float(agent_x) / float(arena_x)), mapping(complexity, float(agent_y) / float(arena_y)), mapping(complexity, float(bot_x) / float(arena_x)), mapping(complexity, float(bot_y) / float(arena_y)), #mapping(complexity, float(agent_hp) / float(character_init_health)), #mapping(complexity, float(bot_hp) / float(character_init_health)), mapping(complexity, float(agent_x - bot_x) / float(arena_x)), mapping(complexity, float(agent_y - bot_y) / float(arena_y)), bot_beam_dir ]]) b = bot_take_action() if np.random.rand(1) < e or i <= 5: a = random.randint(0, 8) else: a, _ = sess.run([predict, Q],feed_dict={input_layer : current_state}) r, c, d, winner = action(a + 1, b) bot_beam_dir = bot_beam_dir_detector() next_state = np.array([[mapping(complexity, float(agent_x) / float(arena_x)), mapping(complexity, float(agent_y) / float(arena_y)), mapping(complexity, float(bot_x) / float(arena_x)), mapping(complexity, float(bot_y) / float(arena_y)), #mapping(complexity, float(agent_hp) / float(character_init_health)), #mapping(complexity, float(bot_hp) / float(character_init_health)), mapping(complexity, float(agent_x - bot_x) / float(arena_x)), mapping(complexity, float(agent_y - bot_y) / float(arena_y)), bot_beam_dir ]]) samples.append([current_state, a, r, next_state]) if len(samples) > 10: for count in xrange(batch_size): [batch_current_state, action_taken, reward, batch_next_state] = samples[random.randint(0, len(samples) - 1)] batch_allQ = sess.run(Q, feed_dict={input_layer : batch_current_state}) batch_Q1 = sess.run(Q, feed_dict = {input_layer : batch_next_state}) batch_maxQ1 = np.max(batch_Q1) batch_targetQ = batch_allQ batch_targetQ[0][a] = reward + y * batch_maxQ1 sess.run([updateModel], feed_dict={input_layer : batch_current_state, next_Q : batch_targetQ}) rAll += r screen_blit() if d == True: e = 1. / ((i / 50) + 10) success += 1 break #print agent_hp, bot_hp display.update() jList.append(j) rList.append(rAll) print winner I'm pretty sure that if you have pygame and Tensorflow and matplotlib installed in a python environment you should be able to see the animations of the bot and the agent "fighting".
I digressed in the update, but it would be awesome if somebody could also address my specific problem along with the original general problem.
Thanks!
Update #2 on August 18, 2017:
Based on the advice of @NeilSlater, I've implemented experience replay into my model. The algorithm has improved, but I'm going to look for more better improvement options that offer convergence.
Update #3 on August 22, 2017:
I've noticed that if the agent hits the bot with a bullet on a turn and the action the bot taken on that turn was not "fire a bullet", then the wrong actions would be given credit. Thus, I've turned the bullets into beams so the bot/agent takes damage on the turn the beam's fired.