Replies: 5 comments
-
I came across the OIG dataset, which is a large open-source instruction dataset consisting of approximately 43M instructions. I believe this dataset could be a valuable addition to GitHub for those working on chatbot technology and related projects. The dataset has been released by LAION.ai, along with its volunteers, Ontocord, Together and other members of the open source community. The purpose of this release is to create equal access to chatbot technology and encourage improvements from contributors. |
Beta Was this translation helpful? Give feedback.
-
As we work towards training the RHLF LLaMa model, I propose that we stack all the datasets we have found for this task. By combining these datasets, we can increase the variety and size of our training data, which will ultimately improve the accuracy and performance of our model. I suggest that we carefully review each dataset to ensure that they are relevant to our task and that they meet our quality standards. We should also consider the format and structure of each dataset and determine the best way to combine them into a single training set. |
Beta Was this translation helpful? Give feedback.
-
How is any of this related to llama.cpp? |
Beta Was this translation helpful? Give feedback.
-
Not an expert but it seems to be very well explained. I think that the question of how we can improve the quality of results we get during inference by using RLHF algorithm on datasets as you described is very interesting. Now, is it the role of the llama.cpp project? I think llama.cpp is here to allow us to convert, quantize and mostly run (inference phase) models in the most optimized way. Maybe the question of how to optimize current datasets with RLHF should be the goal of another open-source project. And/or this conversation should be moved to the "discussion" tab. |
Beta Was this translation helpful? Give feedback.
-
Similar approach to RLHF (but lower quality) is used to train alpaca and vicuna, among others. Vote to close this issue. |
Beta Was this translation helpful? Give feedback.
-
Link to ColosallChat
Add RLHF like ColosallChat on bigger dataset to achieve ChatGPT quality
Although models in the GPT series, such as ChatGPT and GPT-4, are highly powerful, they are unlikely to be fully open-sourced. Fortunately, the open-source community has been working hard to address this.
For example, Meta has open-sourced the LLaMA model, which offers parameter sizes ranging from 7 billion to 65 billion. A 13 billion parameter model can outperform the 175 billion GPT-3 model on most benchmark tests. However, since it doesn’t have an instruct tuning stage, its actual generated results are not satisfactory.
Stanford’s Alpaca generates training data in a self-instructed manner by calling OpenAI’s API. With only 7 billion parameters, this lightweight model can be fine-tuned at a fraction of the cost to achieve conversational performance similar to a very large language model like GPT-3.5 with 175 billion parameters.
However, existing open-source solutions can only be considered as supervised fine-tuned models in the first stage of RLHF (Reinforcement Learning from Human Feedback), with subsequent alignment and fine-tuning stages not performed. Additionally, Alpaca’s training dataset is limited to English, which to some extent restricts the model’s performance.
Yet, the impressive effects of ChatGPT and GPT-4 are due to the introduction of RLHF into the training process, which increases the consistency of the generated content with human values.
Training Dataset Open Source
ColossalChat releases a bilingual dataset comprising approximately 100,000 Q&A pairs in both English and Chinese. The dataset was collected and cleaned from real-life question scenarios on social media platforms, serving as the seed dataset, and was expanded using self-instruct technology, and annotation costs were approximately $900. Compared to datasets generated by other self-instruct methods, this dataset contains more realistic and diverse seed data and encompasses a wider range of topics. The dataset is suitable for both fine-tuning and RLHF training. With the provision of high-quality data, ColossalChat can achieve better dialogue interactions and also support Chinese.
RLHF Algorithm Replication
The RLHF algorithm replication involves three stages:
In RLHF-Stage1, supervised instruct fine-tuning is performed using the datasets mentioned earlier to fine-tune the model.
In RLHF-Stage2, a reward model is trained to assign corresponding scores by manually ranking different outputs for the same prompt, which then supervises the training of the reward model.
In RLHF-Stage3, the reinforcement learning algorithm is being used, which is the most complex part of the training process:
In the PPO part, ColossalChat follows a two-stage process: first, the make experience stage, which uses SFT (Supervised Fine-Tuning), Actor, RM (Reward Model), and Critic models to calculate generated experience and store it in the buffer. Then comes the parameter update stage, which calculates the policy loss and value loss using the experience.
In the PTX part, ColossalChat calculates the cross-entropy loss between the Actor’s output response and the response part of the input corpus. This loss is used to add pre-training gradients to the PPO gradient to maintain the language model’s original performance and prevent forgetting. Finally, the policy loss, value loss, and PTX loss are summed up for backpropagation and parameter update.
Beta Was this translation helpful? Give feedback.
All reactions