You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
RLHF with PPO involves an extra step to train a reward model (see #175).
For good results, the reward model needs to be of very high quality and tuned on a large amount of data. While OpenAI and Meta have successfully used PPO in their pipeline, the OS community struggled to get good results. Probably due to the lack of good training data for the reward model.
DPO (#530) is a technique that allows to train with Human Feedback in a more stable way on good+bad sample pairs and without the need for an additional reward model. It is also much quicker to train as the generate steps are skipped.
The text was updated successfully, but these errors were encountered:
🔧 Proposed code refactoring
Deprecate RLHF with PPO in favor of DPO.
Motivation
RLHF with PPO involves an extra step to train a reward model (see #175).
For good results, the reward model needs to be of very high quality and tuned on a large amount of data. While OpenAI and Meta have successfully used PPO in their pipeline, the OS community struggled to get good results. Probably due to the lack of good training data for the reward model.
DPO (#530) is a technique that allows to train with Human Feedback in a more stable way on good+bad sample pairs and without the need for an additional reward model. It is also much quicker to train as the generate steps are skipped.
The text was updated successfully, but these errors were encountered: