The following objective is taken from the paper 'Training language models to follow instructions with human feedback':
which is used to fine-tune the pre-trained language model using Proximal Policy Optimization (PPO). In the original paper, the objective of PPO is as follows:
comparing the two objectives we can see the term with beta in equation 2 must be the KL term in equation 5.
Now there was a previous question here.
NO ML/RL is really needed.
$\pi_\phi^{\mathrm{RL}}= \pi_\theta\left(a_t \mid s_t\right)$ and $\pi_\phi^{\mathrm{SFT}}= \pi_{\theta old}\left(a_t \mid s_t\right)$
My question is different as I don't quite understand how the KL terms are equivalent.
As If we take the InstructGPT objective and isolate the KL part we have
$E_{(x, y) \sim D_{\pi_\phi^{\mathrm{RL}}}}\left[-\beta \log \left(\pi_\phi^{\mathrm{RL}}(y \mid x) / \pi^{\mathrm{SFT}}(y \mid x)\right)\right]$ = $-\beta E_{(x, y) \sim D_{\pi_\phi^{\mathrm{RL}}}}\left[\log \left(\pi_\phi^{\mathrm{RL}}(y \mid x) / \pi^{\mathrm{SFT}}(y \mid x)\right)\right]$ = $-\beta \mathrm{KL}(\pi_\phi^{\mathrm{RL}} | \pi^{\mathrm{SFT}})$
While the term in the PPO equation is effectively $-\beta \mathrm{KL}(\pi^{\mathrm{SFT}} |\pi_\phi^{\mathrm{RL}} )$ ?