Understanding Objective in OpenAI InstructGPT paper?

Question

The following objective is taken from the paper 'Training language models to follow instructions with human feedback':which is used to fine-tune the pre-trained language model using Proximal Policy Optimization (PPO). In the original paper, the objective of PPO is as follows: comparing the two objectives we can see the term with beta in equation 2 must be the KL term in equation 5.

Now there was a previous question here.

NO ML/RL is really needed.

$\pi_\phi^{\mathrm{RL}}= \pi_\theta\left(a_t \mid s_t\right)$ and $\pi_\phi^{\mathrm{SFT}}= \pi_{\theta old}\left(a_t \mid s_t\right)$

My question is different as I don't quite understand how the KL terms are equivalent.

As If we take the InstructGPT objective and isolate the KL part we have

$E_{(x, y) \sim D_{\pi_\phi^{\mathrm{RL}}}}\left[-\beta \log \left(\pi_\phi^{\mathrm{RL}}(y \mid x) / \pi^{\mathrm{SFT}}(y \mid x)\right)\right]$ = $-\beta E_{(x, y) \sim D_{\pi_\phi^{\mathrm{RL}}}}\left[\log \left(\pi_\phi^{\mathrm{RL}}(y \mid x) / \pi^{\mathrm{SFT}}(y \mid x)\right)\right]$ = $-\beta \mathrm{KL}(\pi_\phi^{\mathrm{RL}} | \pi^{\mathrm{SFT}})$

While the term in the PPO equation is effectively $-\beta \mathrm{KL}(\pi^{\mathrm{SFT}} |\pi_\phi^{\mathrm{RL}} )$ ?

I would say that there is not much in common between them... $r_\theta$ in PPO is the importance sampling ratio, here is the actual reward given by the reward model.. — Alberto
– Alberto, Commented Apr 4, 2023 at 15:37
1. The link to the 'previous question' in the post is broken. 2. Agree that the objective in InstructGPT is weird and not clear. The most important part, which is the ratio of new policy to the old policy, multiplied by the Advantage, is missing. so the only trainable term is in the KL divergence, and that feels like a bug. 3. I also think they replaced by mistake the order of terms in the KL term. — Nathan G
– Nathan G, Commented Mar 13, 2024 at 13:44

benderbendingrodriguez84 · Accepted Answer · 2023-08-03 19:51:55Z

1

I agree, the PPO objective equation is incorrect in the paper, it should be

−𝛽KL(𝜋SFT|𝜋RL𝜙)

as you correctly pointed out.

answered Aug 3, 2023 at 19:51

benderbendingrodriguez84

111 bronze badge

1

$\begingroup$ Hi, this answer can be improved by editing it to provide a justification or explanation. $\endgroup$

Arya McCarthy
– Arya McCarthy

2023-08-03 20:01:21 +00:00
Commented Aug 3, 2023 at 20:01
$\begingroup$ Besides this bug in the equation of the InstructGPT paper, what about the missing expression of the advantage? $\endgroup$

Nathan G
– Nathan G

2024-03-13 13:47:37 +00:00
Commented Mar 13, 2024 at 13:47

Add a comment |

Stack Exchange Network

Understanding Objective in OpenAI InstructGPT paper?

1 Answer 1

Hot Network Questions

Understanding Objective in OpenAI InstructGPT paper?

1 Answer 1

Related

Hot Network Questions