huggingface · younesbelkada · Jul 14, 2023 · Jul 10, 2023 · Jul 10, 2023 · Jul 10, 2023
diff --git a/README.md b/README.md
@@ -163,7 +163,7 @@ train_stats = ppo_trainer.step([query_tensor[0]], [response_tensor[0]], reward)
 ```
 
 ### Advanced example: IMDB sentiment
-For a detailed example check out the example python script `examples/sentiment/scripts/gpt2-sentiment.py`, where GPT2 is fine-tuned to generate positive movie reviews. An few examples from the language models before and after optimisation are given below:
+For a detailed example check out the example python script `examples/scripts/sentiment_tuning.py`, where GPT2 is fine-tuned to generate positive movie reviews. An few examples from the language models before and after optimisation are given below:
 
 <div style="text-align: center">
 <img src="https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/table_imdb_preview.png" width="800">

diff --git a/docs/source/_toctree.yml b/docs/source/_toctree.yml
@@ -6,33 +6,31 @@
   - local: installation
     title: Installation
   - local: customization
-    title: Customize your training
+    title: Customize your Training
   - local: logging
-    title: Understanding logs
+    title: Understanding Logs
   title: Get started
 - sections:
   - local: models
     title: Model Classes
   - local: trainer
     title: Trainer Classes
   - local: reward_trainer
-    title: Training your own reward model
+    title: Reward Model Training
   - local: sft_trainer
-    title: Supervised fine-tuning
-  - local: extras
-    title: Extras - Better model output without reinforcement learning
+    title: Supervised Fine-Tuning
+  - local: best_of_n
+    title: Best of N Sampling
   title: API
 - sections: 
   - local: sentiment_tuning
     title: Sentiment Tuning
   - local: lora_tuning_peft
-    title: Peft support - Low rank adaption of 8 bit models
-  - local: summarization_reward_tuning
-    title: Summarization Reward Tuning
+    title: Training with PEFT
   - local: detoxifying_a_lm
     title: Detoxifying a Language Model
   - local: using_llama_models
-    title: Using LLaMA with TRL
+    title: Training StackLlama
   - local: multi_adapter_rl
-    title: Multi Adapter RL (MARL) - a single base model for everything
+    title: Multi Adapter RLHF
   title: Examples
diff --git a/docs/source/extras.mdx → docs/source/best_of_n.mdx b/docs/source/extras.mdx → docs/source/best_of_n.mdx
@@ -1,4 +1,4 @@
-# Extras: Alternative ways to get better model output without RL based fine-tuning 
+# Best of N sampling: Alternative ways to get better model output without RL based fine-tuning 
 
 Within the extras module is the `best-of-n` sampler class that serves as an alternative method of generating better model output.
 As to how it fares against the RL based fine-tuning, please look in the `examples` directory for a comparison example

diff --git a/docs/source/detoxifying_a_lm.mdx b/docs/source/detoxifying_a_lm.mdx
@@ -8,8 +8,8 @@ Here's an overview of the notebooks and scripts in the [TRL toxicity repository]
 
 | File | Description | Colab link |
 |---|---| --- |
-| [`gpt-j-6b-toxicity.py`](https://github.com/lvwerra/trl/blob/main/examples/toxicity/scripts/gpt-j-6b-toxicity.py) | Detoxify `GPT-J-6B` using PPO | x | 
-| [`evaluate-toxicity.py`](https://github.com/lvwerra/trl/blob/main/examples/toxicity/scripts/evaluate-toxicity.py) | Evaluate de-toxified models using `evaluate` | x | 
+| [`gpt-j-6b-toxicity.py`](https://github.com/lvwerra/trl/blob/main/examples/research_projects/toxicity/scripts/gpt-j-6b-toxicity.py) | Detoxify `GPT-J-6B` using PPO | x | 
+| [`evaluate-toxicity.py`](https://github.com/lvwerra/trl/blob/main/examples/research_projects/toxicity/scripts/evaluate-toxicity.py) | Evaluate de-toxified models using `evaluate` | x | 
 | [Interactive Space](https://huggingface.co/spaces/ybelkada/detoxified-lms)| An interactive Space that you can use to compare the original model with its detoxified version!| x |
 
 ## Context
@@ -174,7 +174,7 @@ Below are few generation examples of `gpt-j-6b-detox` model:
 <img src="https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/trl-toxicity-examples.png">
 </div>
 
-The evaluation script can be found [here](https://github.com/lvwerra/trl/blob/main/examples/toxicity/scripts/evaluate-toxicity.py).
+The evaluation script can be found [here](https://github.com/lvwerra/trl/blob/main/examples/research_projects/toxicity/scripts/evaluate-toxicity.py).
 
 ### Discussions
 

diff --git a/docs/source/index.mdx b/docs/source/index.mdx
@@ -4,6 +4,28 @@
 
 # TRL - Transformer Reinforcement Learning
 
-With the TRL (Transformer Reinforcement Learning) library you can train transformer language models with reinforcement learning. The library is integrated with 🤗 [transformers](https://github.com/huggingface/transformers).
+TRL is a full stack library where we provide a set of tools to train transformer language models with Reinforcement Learning, from the Supervised Fine-tuning step (SFT), Reward Modeling step (RM) to the Proximal Policy Optimization (PPO) step. 
+The library is integrated with 🤗 [transformers](https://github.com/huggingface/transformers).
 
-TRL supports decoder models such as GPT-2, BLOOM, GPT-Neo which can all be optimized using Proximal Policy Optimization (PPO). You can find installation instructions in the [installation guide](installation) and an introduction to the library in the [Quickstart section](quickstart). There is also a more [in-depth example](sentiment_tuning) to tune GPT-2 to produce positive movie reviews.
+<div style="text-align: center">
+<img src="https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/TRL-readme.png">
+</div>
+
+Check the appropriate sections of the documentation depending on your needs:
+
+API documentation:
+
+- [Model Classes](models): *A brief overview of what each public model class does.*
+- [`SFTTrainer`](sft_trainer): *Supervise Fine-tune your model easily with `SFTTrainer`*
+- [`RewardTrainer`](reward_trainer): *Train easily your reward model using `RewardTrainer`.*
+- [`PPOTrainer`](trainer): *Further fine-tune the supervised fine-tuned model using PPO algorithm*
+- [Best-of-N Samppling](best-of-n): *Use best of n sampling as an alternative way to sample predictions from your active model*
+
+
+Examples: 
+
+- [Sentiment Tuning](sentiment_tuning): *Fine tune your model to generate positive movie contents*
+- [Training with PEFT](lora_tuning_peft): *Memory efficient RLHF training using adapters with PEFT*
+- [Detoxifying LLMs](detoxifying_a_lm): *Detoxify your language model through RLHF*
+- [StackLlama](using_llama_models): *End-to-end RLHF training of a Llama model on Stack exchange dataset*
+- [Multi-Adapter Training](multi_adapter_rl): *Use a single base model and multiple adapters for memory efficient end-to-end training*
diff --git a/docs/source/lora_tuning_peft.mdx b/docs/source/lora_tuning_peft.mdx
@@ -7,14 +7,9 @@ Here's an overview of the `peft`-enabled notebooks and scripts in the [trl repos
 
 | File | Task | Description | Colab link |
 |---|---| --- |
-| [`gpt2-sentiment_peft.py`](https://github.com/lvwerra/trl/blob/main/examples/sentiment/scripts/gpt2-sentiment_peft.py) | Sentiment | Same as the sentiment analysis example, but learning a low rank adapter on a 8-bit base model |  |
-| [`cm_finetune_peft_imdb.py`](https://github.com/lvwerra/trl/blob/main/examples/sentiment/scripts/gpt-neox-20b_peft/cm_finetune_peft_imdb.py) | Sentiment | Fine tuning a low rank adapter on a frozen 8-bit model for text generation on the imdb dataset. |  |
-| [`merge_peft_adapter.py`](https://github.com/lvwerra/trl/blob/main/examples/sentiment/scripts/gpt-neox-20b_peft/merge_peft_adapter.py) | 🤗 Hub |  Merging of the adapter layers into the base model’s weights and storing these on the hub. |  |
-| [`gpt-neo-20b_sentiment_peft.py`](https://github.com/lvwerra/trl/blob/main/examples/sentiment/scripts/gpt-neox-20b_peft/gpt-neo-20b_sentiment_peft.py) | Sentiment | Sentiment fine-tuning of a low rank adapter to create positive reviews. |  |
-| [`gpt-neo-1b_peft.py`](https://github.com/lvwerra/trl/blob/main/examples/sentiment/scripts/gpt-neo-1b-multi-gpu/gpt-neo-1b_peft.py) | Sentiment | Sentiment fine-tuning of a low rank adapter to create positive reviews using 2 GPUs. |  |
-| [`stack_llama/rl_training.py`](https://github.com/lvwerra/trl/blob/main/examples/stack_llama/scripts/rl_training.py) | RLHF | Distributed fine-tuning of the 7b parameter LLaMA models with a learned reward model and `peft`. |  |
-| [`stack_llama/reward_modeling.py`](https://github.com/lvwerra/trl/blob/main/examples/stack_llama/scripts/reward_modeling.py) | Reward Modeling | Distributed training of the 7b parameter LLaMA reward model with `peft`. |  |
-| [`stack_llama/supervised_finetuning.py`](https://github.com/lvwerra/trl/blob/main/examples/stack_llama/scripts/supervised_finetuning.py) | SFT | Distributed instruction/supervised fine-tuning of the 7b parameter LLaMA model with `peft`. |  |
+| [`stack_llama/rl_training.py`](https://github.com/lvwerra/trl/blob/main/examples/research_projects/stack_llama/scripts/rl_training.py) | RLHF | Distributed fine-tuning of the 7b parameter LLaMA models with a learned reward model and `peft`. |  |
+| [`stack_llama/reward_modeling.py`](https://github.com/lvwerra/trl/blob/main/examples/research_projects/stack_llama/scripts/reward_modeling.py) | Reward Modeling | Distributed training of the 7b parameter LLaMA reward model with `peft`. |  |
+| [`stack_llama/supervised_finetuning.py`](https://github.com/lvwerra/trl/blob/main/examples/research_projects/stack_llama/scripts/supervised_finetuning.py) | SFT | Distributed instruction/supervised fine-tuning of the 7b parameter LLaMA model with `peft`. |  |
 
 ## Installation
 Note: peft is in active development, so we install directly from their Github page.
@@ -132,8 +127,6 @@ Simply load your model with a custom `device_map` argument on the `from_pretrain
 
 Also make sure to have the `lm_head` module on the first GPU device as it may throw an error if it is not on the first device. As this time of writing, you need to install the `main` branch of `accelerate`: `pip install git+https://github.com/huggingface/accelerate.git@main` and `peft`: `pip install git+https://github.com/huggingface/peft.git@main`.
 
-That all you need to do to use NPP. Check out the [gpt-neo-1b_peft.py](https://github.com/lvwerra/trl/blob/main/examples/sentiment/scripts/gpt-neo-1b-multi-gpu/gpt-neo-1b_peft.py) example for a more details usage of NPP.
-
 ### Launch scripts
 
 Although `trl` library is powered by `accelerate`, you should run your training script in a single process. Note that we do not support Data Parallelism together with NPP yet.

diff --git a/docs/source/multi_adapter_rl.mdx b/docs/source/multi_adapter_rl.mdx
@@ -11,7 +11,7 @@ You just need to install `peft` and optionally install `bitsandbytes` as well if
 You need to address this approach in three stages that we summarize as follows:
 
 1- Train a base model on the target domain (e.g. `imdb` dataset) - this is the Supervised Fine Tuning stage - it can leverage the `SFTTrainer` from TRL.
-2- Train a reward model using `peft`. This is required in order to re-use the adapter during the RL optimisation process (step 3 below). We show an example of leveraging the `RewardTrainer` from TRL in [this example](https://github.com/lvwerra/trl/tree/main/examples/0-abstraction-RL/reward_modeling.py)
+2- Train a reward model using `peft`. This is required in order to re-use the adapter during the RL optimisation process (step 3 below). We show an example of leveraging the `RewardTrainer` from TRL in [this example](https://github.com/lvwerra/trl/tree/main/examples/scripts/reward_trainer.py)
 3- Fine tune new adapters on the base model using PPO and the reward adapter. ("0 abstraction RL")
 
 Make sure to use the same model (i.e. same architecure and same weights) for the stages 2 & 3. 

diff --git a/docs/source/reward_trainer.mdx b/docs/source/reward_trainer.mdx
@@ -2,6 +2,8 @@
 
 TRL supports custom reward modeling for anyone to perform reward modeling on their dataset and model. 
 
+Check out a complete flexible example inside [`examples/scripts`](https://github.com/lvwerra/trl/tree/main/examples/scripts/reward_trainer.py) folder.
+
 ## Expected dataset format
 
 The reward trainer expects a very specific format for the dataset. Since the model will be trained to predict which sentence is the most relevant, given two sentences. We provide an example from the [`Anthropic/hh-rlhf`](https://huggingface.co/datasets/Anthropic/hh-rlhf) dataset below:

diff --git a/docs/source/sentiment_tuning.mdx b/docs/source/sentiment_tuning.mdx
@@ -6,12 +6,11 @@ Here's an overview of the notebooks and scripts in the [trl repository](https://
 
 | File | Description | Colab link |
 |---|---| --- |
-| [`gpt2-sentiment.ipynb`](https://github.com/lvwerra/trl/blob/main/examples/sentiment/notebooks/gpt2-sentiment.ipynb) | Fine-tune GPT2 to generate positive movie reviews. | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/lvwerra/trl/blob/main/examples/sentiment/notebooks/gpt2-sentiment.ipynb)
+| [`gpt2-sentiment.ipynb`](https://github.com/lvwerra/trl/blob/main/examples/notebooks/gpt2-sentiment.ipynb) | Fine-tune GPT2 to generate positive movie reviews. | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/lvwerra/trl/blob/main/examples/sentiment/notebooks/gpt2-sentiment.ipynb)
  |
-| [`gpt2-sentiment-control.ipynb`](https://github.com/lvwerra/trl/blob/main/examples/sentiment/notebooks/gpt2-sentiment-control.ipynb)  | Fine-tune GPT2 to generate movie reviews with controlled sentiment. | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/lvwerra/trl/blob/main/examples/sentiment/notebooks/gpt2-sentiment-control.ipynb)
+| [`gpt2-sentiment-control.ipynb`](https://github.com/lvwerra/trl/blob/main/examples/notebooks/gpt2-sentiment-control.ipynb)  | Fine-tune GPT2 to generate movie reviews with controlled sentiment. | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/lvwerra/trl/blob/main/examples/sentiment/notebooks/gpt2-sentiment-control.ipynb)
   |
-| [`gpt2-sentiment.py`](https://github.com/lvwerra/trl/blob/main/examples/sentiment/scripts/gpt2-sentiment.py) | Same as the notebook, but easier to use to use in multi-GPU setup. | x | 
-| [`t5-sentiment.py`](https://github.com/lvwerra/trl/blob/main/examples/sentiment/scripts/t5-sentiment.py) | Same as GPT2 script, but for a Seq2Seq model (T5). | x | 
+| [`gpt2-sentiment.py`](https://github.com/lvwerra/trl/blob/main/examples/ppo_trainer/sentiment_tuning.py) | Same as the notebook, but easier to use to use in multi-GPU setup with any architecture. | x | 
 
 
 ## Installation
@@ -31,5 +30,9 @@ The `trl` library is powered by `accelerate`. As such it is best to configure an
 
 ```bash
 accelerate config # will prompt you to define the training configuration
-accelerate launch scripts/gpt2-sentiment.py # launches training
-```
+accelerate launch yourscript.py # launches training
+```
+
+## Few notes on multi-GPU 
+
+To run in multi-GPU setup with DDP (distributed Data Parallel) change the `device_map` value to `device_map={"": Accelerator().process_index}` and make sure to run your script with `accelerate launch yourscript.py`. If you want to apply naive pipeline parallelism you can use `device_map="auto"`.
diff --git a/docs/source/sft_trainer.mdx b/docs/source/sft_trainer.mdx
@@ -2,6 +2,8 @@
 
 Supervised fine-tuning (or SFT for short) is a crucial step in RLHF. In TRL we provide an easy-to-use API to create your SFT models and train them with few lines of code on your dataset.
 
+Check out a complete flexible example inside [`examples/scripts`](https://github.com/lvwerra/trl/tree/main/examples/scripts/sft_trainer.py) folder.
+
 ## Quickstart
 
 If you have a dataset hosted on the 🤗 Hub, you can easily fine-tune your SFT model using [`SFTTrainer`] from TRL. Let us assume your dataset is `imdb`, the text you want to predict is inside the `text` field of the dataset, and you want to fine-tune the `facebook/opt-350m` model. 

diff --git a/docs/source/summarization_reward_tuning.mdx b/docs/source/summarization_reward_tuning.mdx
diff --git a/docs/source/trainer.mdx b/docs/source/trainer.mdx
@@ -16,6 +16,10 @@ We also support a `RewardTrainer` that can be used to train a reward model.
 
 [[autodoc]] RewardTrainer
 
+## SFTTrainer
+
+[[autodoc]] SFTTrainer
+
 ## set_seed
 
 [[autodoc]] set_seed