Deriving REINFORCE Algorithm with state-action marginals

Question

(Related question is here: Deriving 'State-Action marginal' in Reinforcement Learning)

The lecture of CS 285 (Berkeley) https://www.youtube.com/watch?v=GKoKNYaBvM0&list=PL_iWQOsE6TfVYGEGiAOMaOzzv41Jfm_Ps&index=15 (3:31) tells us that the following two objectives are equivalent:

$$ \arg\max_{\theta} J(\theta) = E_{\tau \sim p_\theta(\tau)}\Big[\sum_{t}r(s_t,a_t)\Big] $$ and $$ \arg\max_{\theta} J(\theta) = E_{(s_t,a_t) \sim p(s_t,a_t)}r(s_t, a_t) $$

If we go from Eq 1 $$ \arg\max_{\theta} J(\theta) = E_{\tau \sim p_\theta(\tau)}\Big[\sum_{t}r(s_t,a_t)\Big] $$ with some derivation + likelihood-ratio trick, we arrive at the REINFORCE algorithm with the following update rule as shown in the lecture notes: $$ \nabla_\theta J \propto \frac{1}{N}\sum_{i=1}^N \Big(\sum_{t=1}^T\nabla_\theta \pi(a_{i,t} |s_{i,t}) \Big) \Big(\sum_{t=1}^T r(s_{i,t}, a_{i,t})\Big) $$ where $N$ is the number of trajectories. What confuses me is that the second term of this update requires us to sum up all the rewards from t = 1 to T for each sampled trajectory. But if we derive from $$ \arg\max_{\theta} J(\theta) = E_{(s_t,a_t) \sim p(s_t,a_t)}r(s_t, a_t) $$

it seems to me that the resulting update rule will be something like: $$ \nabla_\theta J \propto \frac{1}{N}\sum_{i=1}^N \Big(\sum_{t=1}^T\nabla_\theta \pi(a_{i,t} |s_{i,t}) r(s_{i,t}, a_{i,t}) \Big) $$ because the expectation and the reward is with repsect to a specific time step. But if two objectives are equivalent as the lecture show, we should arrive at the same update rule.

Can anyone advise what I am missing?

cinch · Accepted Answer · 2024-11-25 07:57:14Z

You're right while the second step-based objective conceptually works on state-action pairs, its sampling still happens over trajectories as natural units of sampling in RL to ensure that we generate valid sequences of states and actions. But your update schemes via the score function likelihood-ratio trick is slightly off and should be corrected as below.

$$ \nabla_\theta J \propto \frac{1}{N}\sum_{i=1}^N \Big(\sum_{t=1}^T\nabla_\theta \log \pi_\theta(a_{i,t} |s_{i,t}) \Big) \Big(\sum_{t=1}^T r(s_{i,t}, a_{i,t})\Big) $$ $$ \nabla_\theta J \propto \frac{1}{N}\sum_{i=1}^N \Big(\sum_{t=1}^T\nabla_\theta \log \pi_\theta(a_{i,t} |s_{i,t}) r(s_{i,t}, a_{i,t}) \Big) $$

Also notice that the difference between these two equivalent policy gradient methods' update schemes is not only their apparent different reward weighting as you mentioned but also their different multiplication form between the gradient of log-policy and reward, for any trajectory $i$. For the trajectory-based formulation you need to first sum the log-policy gradients across all steps of a trajectory then weigh the sum with the sum of reward of the same trajectory, while for the step-based formulation you need to first weigh each step's log-policy gradient with its same step reward then sum across all steps of the trajectory.

Therefore the two update schemes are indeed equivalent. The trajectory-based formulation gives REINFORCE Monte Carlo method as you referenced, while the step-based formulation gives actor-critic like temporal-difference (TD) methods which is more stable than unbiased REINFORCE method due to above stepwise immediate reward weighting during practical computation.

Thanks for the explanation. I am not sure how you arrive at the conclusion of "Therefore the two update schemes are indeed equivalent." though. Mind elaborating ? The two update rules, from the surface, clearly look different. — Jing
– Jing, Commented Nov 26, 2024 at 4:00
$$ \nabla_\theta J \propto \frac{1}{N}\sum_{i=1}^N \Big(\sum_{t=1}^T\nabla_\theta \log \pi_\theta(a_{i,t} |s_{i,t}) \Big) \Big(\sum_{t=1}^T r(s_{i,t}, a_{i,t})\Big) \\ = \frac{1}{N}\sum_{i=1}^N \Big(\sum_{t=1}^T \Big[\nabla_\theta \log \pi_\theta(a_{i,t} |s_{i,t}) \Big(\sum_{t=1}^T r(s_{i,t}, a_{i,t})\Big) \Big]\Big) \\ \neq \nabla_\theta J \propto \frac{1}{N}\sum_{i=1}^N \Big(\sum_{t=1}^T\nabla_\theta \log \pi_\theta(a_{i,t} |s_{i,t}) r(s_{i,t}, a_{i,t}) \Big) $$ — Jing
– Jing, Commented Nov 26, 2024 at 4:08
Indeed they're different as my mentioned "not only...but also..." besides reward weighting difference you mentioned. Equivalent is in the sense of same optimizing objective as gradient of same $J$ despite different expressions (also your first reference derived equivalence in detail and you knew this as expressed in OP). Though theoretically equivalent in terms of expectation, in practical numerical computation they're quite different since you only sample finitely, and REINFORCE is known for its high variance, Actor-Critic is known for more stable and efficient learning. Hope now clarifies — cinch
– cinch, Commented Nov 26, 2024 at 4:08
I can relate actor-critic to the first formulation (by replacing sampled cumulative reward with an estimated value function) but not the second formulation (the state-action marginal one). The key confusion here, still, is that two equivalent objectives arrive at very different update rules - One update per-step log-probability of action by the total reward, whereas another one update per-step log-probability of action by ONLY the reward at that particular time step. — Jing
– Jing, Commented Nov 26, 2024 at 4:18
@Jing Focus on below answer part. The first update rule's practical variant bring REINFORCE Monte Carlo method (not actor-critic as you claimed!) since the weight is total reward of a trajectory and times the sum of gradients of log-policies and the second step in your above comment's formula is wrong which shows you're not aware of this critical difference which maybe the root cause of your confusion. "For the trajectory-based formulation you need to first sum the log-policy gradients across all steps of a trajectory then weigh the sum with the sum of reward of the same trajectory" — cinch
– cinch, Commented Nov 26, 2024 at 4:55

Stack Exchange Network

Deriving REINFORCE Algorithm with state-action marginals

1 Answer 1

Linked

Hot Network Questions