Policy Gradient not "learning"

Question

I'm attempting to implement the policy gradient taken from the "Hands-On Machine Learning" book by Geron, which can be found here. The notebook uses Tensorflow and I'm attempting to do it with PyTorch.

My models look as follows:

model = nn.Sequential( nn.Linear(4, 128), nn.ELU(), nn.Linear(128, 2), )

Criterion and optimisers:

criterion = nn.BCEWithLogitsLoss() optim = torch.optim.Adam(model.parameters(), lr=0.01)

Training:

env = gym.make("CartPole-v0") n_games_per_update = 10 n_max_steps = 1000 n_iterations = 250 save_iterations = 10 discount_rate = 0.95 for iteration in range(n_iterations): # Run the game 250 times all_rewards = [] all_gradients = [] n_steps = [] optim.zero_grad() for game in range(n_games_per_update): # Run the game 10 times to accumulate gradients current_rewards = [] current_gradients = [] obs = env.reset() for step in range(n_max_steps): # Run a single game a maximum of 1000 steps logit = model(torch.tensor(obs, dtype=torch.float)) output = F.softmax(logit, dim=0) c = Categorical(output) action = c.sample() y = torch.tensor([1.0 - action, action], dtype=torch.float) loss = criterion(logit, y) loss.backward() obs, reward, done, info = env.step(int(action)) current_rewards.append(reward) current_gradients.append([p.grad for p in model.parameters()]) if done: break n_steps.append(step) all_rewards.append(current_rewards) all_gradients.append(current_gradients) # Performs the discount and normalises all_rewards = discount_and_normalize_rewards(all_rewards, discount_rate=discount_rate) # For each batch of 10 games multiply the discounted rewards against the gradients of the # network. Then take the mean for each layer new_gradients = [] for var_index, gradient_placeholder in enumerate(gradient_placeholders): means = [] for game_index, rewards in enumerate(all_rewards): for step, reward in enumerate(rewards): means.append(reward * all_gradients[game_index][step][var_index]) new_gradients.append(torch.mean(torch.stack(means), 0, True).squeeze(0)) # Apply the new gradients to the network for p, g in zip(model.parameters(), new_gradients): p.grad = g.clone() optim.step()

When I run the code for 250 interactions I print the average game length I get:

Iteration: 50, Average Length: 18.2 Iteration: 100, Average Length: 23.4 Iteration: 150, Average Length: 29.9 Iteration: 200, Average Length: 11.2 Iteration: 250, Average Length: 38.6

The network isn't really improving and training for longer doesn't help. My two questions are: 1. Is there anything obviously wrong that I'm doing? 2. I've notices the log of the probability is used in the tensorflow implementation, but I'm not sure how to integrate it here

Karl · Accepted Answer · 2020-05-16 00:54:13Z

I can't say for sure but I think the issue here is you're not subtracting the mean of the rewards.

The idea is that actions with above average reward are positive after mean normalization, while actions with below average reward are negative after mean normalization.

Your update step is -log(P(action))*reward, which you then minimize with your optimizer.

P(action)<1 therefore log(P(action))<0, -log(P(action))>0

If reward>0, -log(P(action))*reward>0. Minimizing this value is the same maximizing log(P(action))*reward<0, which is maximized when P(action)=1.

Conversely, if reward<0, -log(P(action))*reward<0. This has the opposite effect, where P(action) is driven to 0.

The important part is the different sign on above average/below average rewards causes actions associated with good rewards to have their probability increased, while actions associated with bad rewards have their probability decreased.

Stack Exchange Network

Policy Gradient not "learning"

1 Answer 1

Hot Network Questions

Policy Gradient not "learning"

1 Answer 1

Related

Hot Network Questions