1
$\begingroup$

I'm building an actor-critic reinforcement learning algorithm to solve environments. I want to use a single encoder to find representation of my environment.

When I share the encoder with the actor and the critic, my network isn't learning anything:

class Encoder(nn.Module): def __init__(self, state_dim): super(Encoder, self).__init__() self.l1 = nn.Linear(state_dim, 512) def forward(self, state): a = F.relu(self.l1(state)) return a class Actor(nn.Module): def __init__(self, state_dim, action_dim, max_action): super(Actor, self).__init__() self.l1 = nn.Linear(state_dim, 128) self.l3 = nn.Linear(128, action_dim) self.max_action = max_action def forward(self, state): a = F.relu(self.l1(state)) # a = F.relu(self.l2(a)) a = torch.tanh(self.l3(a)) * self.max_action return a class Critic(nn.Module): def __init__(self, state_dim, action_dim): super(Critic, self).__init__() self.l1 = nn.Linear(state_dim + action_dim, 128) self.l3 = nn.Linear(128, 1) def forward(self, state, action): state_action = torch.cat([state, action], 1) q = F.relu(self.l1(state_action)) # q = F.relu(self.l2(q)) q = self.l3(q) return q 

However, when I use different encoder for the actor and different for the critic, it learn properly.

class Actor(nn.Module): def __init__(self, state_dim, action_dim, max_action): super(Actor, self).__init__() self.l1 = nn.Linear(state_dim, 400) self.l2 = nn.Linear(400, 300) self.l3 = nn.Linear(300, action_dim) self.max_action = max_action def forward(self, state): a = F.relu(self.l1(state)) a = F.relu(self.l2(a)) a = torch.tanh(self.l3(a)) * self.max_action return a class Critic(nn.Module): def __init__(self, state_dim, action_dim): super(Critic, self).__init__() self.l1 = nn.Linear(state_dim + action_dim, 400) self.l2 = nn.Linear(400, 300) self.l3 = nn.Linear(300, 1) def forward(self, state, action): state_action = torch.cat([state, action], 1) q = F.relu(self.l1(state_action)) q = F.relu(self.l2(q)) q = self.l3(q) return q 

I'm pretty sure its because of the optimizer. In the shared encoder code, I define it as follows:

self.actor_optimizer = optim.Adam(list(self.actor.parameters())+ list(self.encoder.parameters())) self.critic_optimizer = optim.Adam(list(self.critic.parameters())) +list(self.encoder.parameters())) 

In the separate encoder, its just:

self.actor_optimizer = optim.Adam((self.actor.parameters())) self.critic_optimizer = optim.Adam((self.critic.parameters())) 

two optimizers must be because of the actor critic algorithm, in which the loss of the actor is the value.

How can I combine two optimizers to optimize correctly the encoder?

$\endgroup$
14
  • $\begingroup$ I’m not sure I understand yet; are you using the encoder to transform the state representation to a new representation, and then feeding this new representation to the actor and critic networks? Also, where are you sharing the encoder in the code for the actor / critic? I didn’t see where exactly. $\endgroup$ Commented Apr 28, 2019 at 21:40
  • $\begingroup$ @Hanzy yes I use encoder to create shared representation between actor and critic. I just call enc=encoder(x)->actor(enc)/critic(enc) $\endgroup$ Commented Apr 30, 2019 at 15:43
  • $\begingroup$ I don’t understand why you want to train the encoder this way? Why not just separately train an autoencoder and then use the trained autoencoder to produce a representation that you send to the actor and critic? Also, why not just send it a raw representation? Just curious. Maybe I misunderstand the motivation here. $\endgroup$ Commented Apr 30, 2019 at 16:20
  • $\begingroup$ Shouldn’t the encoder be evaluated (and updated) based on how accurately it encodes the information it’s given rather than the $Q$ values of different state action pairs? By updating a Q value, why update the representation? $\endgroup$ Commented Apr 30, 2019 at 17:30
  • $\begingroup$ How do I train seperatly autoencoder? What is the output for training? And I send raw representation to the encoder, which I want to be updated both by actor and critic to get better results in representing the data. $\endgroup$ Commented Apr 30, 2019 at 17:36

2 Answers 2

0
$\begingroup$

Just use one class inheriting from nn.Module called e.g. ActorCriticModel.

Then, have two members called self.actor and self.critic and define them to have the desired architecture.Then, in the forward() method return two values, one for the actor output (which is a vector) and one for the critic value (which is a scalar).

This way you can use only one optimizer.

$\endgroup$
0
$\begingroup$

A common practice involves using a shared encoder, which is updated based solely on critic loss, as implemented in DrQv2.

$\endgroup$

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.