I'm attempting to implement the policy gradient taken from the "Hands-On Machine Learning" book by Geron, which can be found here. The notebook uses Tensorflow and I'm attempting to do it with PyTorch.
My models look as follows:
model = nn.Sequential( nn.Linear(4, 128), nn.ELU(), nn.Linear(128, 2), ) Criterion and optimisers:
criterion = nn.BCEWithLogitsLoss() optim = torch.optim.Adam(model.parameters(), lr=0.01) Training:
env = gym.make("CartPole-v0") n_games_per_update = 10 n_max_steps = 1000 n_iterations = 250 save_iterations = 10 discount_rate = 0.95 for iteration in range(n_iterations): # Run the game 250 times all_rewards = [] all_gradients = [] n_steps = [] optim.zero_grad() for game in range(n_games_per_update): # Run the game 10 times to accumulate gradients current_rewards = [] current_gradients = [] obs = env.reset() for step in range(n_max_steps): # Run a single game a maximum of 1000 steps logit = model(torch.tensor(obs, dtype=torch.float)) output = F.softmax(logit, dim=0) c = Categorical(output) action = c.sample() y = torch.tensor([1.0 - action, action], dtype=torch.float) loss = criterion(logit, y) loss.backward() obs, reward, done, info = env.step(int(action)) current_rewards.append(reward) current_gradients.append([p.grad for p in model.parameters()]) if done: break n_steps.append(step) all_rewards.append(current_rewards) all_gradients.append(current_gradients) # Performs the discount and normalises all_rewards = discount_and_normalize_rewards(all_rewards, discount_rate=discount_rate) # For each batch of 10 games multiply the discounted rewards against the gradients of the # network. Then take the mean for each layer new_gradients = [] for var_index, gradient_placeholder in enumerate(gradient_placeholders): means = [] for game_index, rewards in enumerate(all_rewards): for step, reward in enumerate(rewards): means.append(reward * all_gradients[game_index][step][var_index]) new_gradients.append(torch.mean(torch.stack(means), 0, True).squeeze(0)) # Apply the new gradients to the network for p, g in zip(model.parameters(), new_gradients): p.grad = g.clone() optim.step() When I run the code for 250 interactions I print the average game length I get:
Iteration: 50, Average Length: 18.2 Iteration: 100, Average Length: 23.4 Iteration: 150, Average Length: 29.9 Iteration: 200, Average Length: 11.2 Iteration: 250, Average Length: 38.6 The network isn't really improving and training for longer doesn't help. My two questions are: 1. Is there anything obviously wrong that I'm doing? 2. I've notices the log of the probability is used in the tensorflow implementation, but I'm not sure how to integrate it here