(Related question is here: Deriving 'State-Action marginal' in Reinforcement Learning)
The lecture of CS 285 (Berkeley) https://www.youtube.com/watch?v=GKoKNYaBvM0&list=PL_iWQOsE6TfVYGEGiAOMaOzzv41Jfm_Ps&index=15 (3:31) tells us that the following two objectives are equivalent:
$$ \arg\max_{\theta} J(\theta) = E_{\tau \sim p_\theta(\tau)}\Big[\sum_{t}r(s_t,a_t)\Big] $$ and $$ \arg\max_{\theta} J(\theta) = E_{(s_t,a_t) \sim p(s_t,a_t)}r(s_t, a_t) $$
If we go from Eq 1 $$ \arg\max_{\theta} J(\theta) = E_{\tau \sim p_\theta(\tau)}\Big[\sum_{t}r(s_t,a_t)\Big] $$ with some derivation + likelihood-ratio trick, we arrive at the REINFORCE algorithm with the following update rule as shown in the lecture notes: $$ \nabla_\theta J \propto \frac{1}{N}\sum_{i=1}^N \Big(\sum_{t=1}^T\nabla_\theta \pi(a_{i,t} |s_{i,t}) \Big) \Big(\sum_{t=1}^T r(s_{i,t}, a_{i,t})\Big) $$ where $N$ is the number of trajectories. What confuses me is that the second term of this update requires us to sum up all the rewards from t = 1 to T for each sampled trajectory. But if we derive from $$ \arg\max_{\theta} J(\theta) = E_{(s_t,a_t) \sim p(s_t,a_t)}r(s_t, a_t) $$
it seems to me that the resulting update rule will be something like: $$ \nabla_\theta J \propto \frac{1}{N}\sum_{i=1}^N \Big(\sum_{t=1}^T\nabla_\theta \pi(a_{i,t} |s_{i,t}) r(s_{i,t}, a_{i,t}) \Big) $$ because the expectation and the reward is with repsect to a specific time step. But if two objectives are equivalent as the lecture show, we should arrive at the same update rule.
Can anyone advise what I am missing?