I'm doing a research about actor-critic methods and I want to make sure that I understand these methods right.
First of all, I understand that as it's a combination of value-based and policy-based methods it will use their functions to estimate the optimal policy, being the loss function the Mean Square Error (MSE), which is the square of the TD Error, the one to learn an accurate estimate of the value function, which is used as feedback for the actor. Being the function: $$L(\phi) = \dfrac{1}{2}\left(V_\phi(s_t) - \left(r_t + \gamma\,V_\phi(s_{t+1})\right)\right)^{2}$$ And expressed with gradient:$$\nabla_\phi L(\phi) = \underset{\tau \sim \pi_\theta}{\mathbb{E}}\left[\nabla_\phi\left(V_\phi(s_t) - \left(r_t + \gamma\,V_\phi(s_{t+1})\right)\right)^2\right]$$ Now for the actor we would use the objective function: $$\nabla_\theta\,J(\theta) = \underset{\tau \sim \pi_\theta}{\mathbb{E}}\left[\nabla_\theta\log\pi_\theta(a\,|\,s)R(\tau)\right]$$
Finally, for Advantage Actor-Critic methods, we would use the Advantage function for both loss and objective functions:$$\nabla_\theta J(\theta) = \underset{\tau \sim \pi_\theta}{\mathbb{E}}\left[\nabla_\theta\log\pi_\theta(a_t\,|\,s_t)\,A(s_t,\,a_t)\right]$$ $$\nabla_\phi L(\phi) = \underset{\tau \sim \pi_\theta}{\mathbb{E}}\left[\nabla_\phi(A(s_t,\,a_t))^{2}\right]$$
Where $\tau$ is an episode of $\pi_\theta$. Am I missing something or is it a good summary of AC methods ? Feel free to make any suggestions or improvements.