1
$\begingroup$

I'm doing a research about actor-critic methods and I want to make sure that I understand these methods right.

First of all, I understand that as it's a combination of value-based and policy-based methods it will use their functions to estimate the optimal policy, being the loss function the Mean Square Error (MSE), which is the square of the TD Error, the one to learn an accurate estimate of the value function, which is used as feedback for the actor. Being the function: $$L(\phi) = \dfrac{1}{2}\left(V_\phi(s_t) - \left(r_t + \gamma\,V_\phi(s_{t+1})\right)\right)^{2}$$ And expressed with gradient:$$\nabla_\phi L(\phi) = \underset{\tau \sim \pi_\theta}{\mathbb{E}}\left[\nabla_\phi\left(V_\phi(s_t) - \left(r_t + \gamma\,V_\phi(s_{t+1})\right)\right)^2\right]$$ Now for the actor we would use the objective function: $$\nabla_\theta\,J(\theta) = \underset{\tau \sim \pi_\theta}{\mathbb{E}}\left[\nabla_\theta\log\pi_\theta(a\,|\,s)R(\tau)\right]$$

Finally, for Advantage Actor-Critic methods, we would use the Advantage function for both loss and objective functions:$$\nabla_\theta J(\theta) = \underset{\tau \sim \pi_\theta}{\mathbb{E}}\left[\nabla_\theta\log\pi_\theta(a_t\,|\,s_t)\,A(s_t,\,a_t)\right]$$ $$\nabla_\phi L(\phi) = \underset{\tau \sim \pi_\theta}{\mathbb{E}}\left[\nabla_\phi(A(s_t,\,a_t))^{2}\right]$$

Where $\tau$ is an episode of $\pi_\theta$. Am I missing something or is it a good summary of AC methods ? Feel free to make any suggestions or improvements.

$\endgroup$
0

1 Answer 1

0
$\begingroup$

Indeed it's a good summary of AC methods and perhaps a few clarifications might enhance your further research.

You didn't mention the advantage of the advantage function for A2C method's actor network which often reduces variance in updates significantly compared to raw TD error and thus stabilizes training, and it's rare for critic's loss to use advantage function unless in some customary or dual critic models. Also many modern AC methods such as A3C and PPO sometimes might add an entropy regularization term $H(\pi_{\theta})$ to the actor's objective to encourage exploration.

Finally your summary seems to only assume an on-policy setup, the update equations at least for the main networks would include experience replay buffer and target network(s) for some popular off-policy methods such as DDPG and SAC which improve sample efficiency as experiences can be reused multiple times for training and improves learnings dynamics as slow soft target updates avoid moving target non-convergence and ensure that data in the replay buffer remaining relevant for a longer period.

$\endgroup$
2
  • $\begingroup$ Thank you for your feedback. I wanted to make an introduction to AC methods, in which I would mention the intoduction of the loss function to critic (having previously defined value-based and policy-based methods and objective function). Then I was thinking about defining the methods, the first being A2C, and their functions so then I would continue with TRPO, PPO and SAC among others. However, right now I wanted to focus on A2C. You said that I did not mention the advantage for the actor network but I included it in the fourth equation. $\endgroup$ Commented Dec 7, 2024 at 23:29
  • $\begingroup$ Your plan to introduce A2C in the context of actor-critic methods sounds well-structured and pedagogically sound. In addition to the clarification of my above answer, also you may further ponder about why off-policy SAC only needs one critic target network while DDPG needs target networks for both critic and actor. Hope this clarifies and helpful to your concerned question here. $\endgroup$ Commented Dec 8, 2024 at 3:48

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.