Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[examples] Big refactor of examples and documentation #509

Merged
merged 17 commits into from
Jul 14, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -163,7 +163,7 @@ train_stats = ppo_trainer.step([query_tensor[0]], [response_tensor[0]], reward)
```

### Advanced example: IMDB sentiment
For a detailed example check out the example python script `examples/sentiment/scripts/gpt2-sentiment.py`, where GPT2 is fine-tuned to generate positive movie reviews. An few examples from the language models before and after optimisation are given below:
For a detailed example check out the example python script `examples/scripts/sentiment_tuning.py`, where GPT2 is fine-tuned to generate positive movie reviews. An few examples from the language models before and after optimisation are given below:

<div style="text-align: center">
<img src="https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/table_imdb_preview.png" width="800">
Expand Down
20 changes: 9 additions & 11 deletions docs/source/_toctree.yml
Original file line number Diff line number Diff line change
Expand Up @@ -6,33 +6,31 @@
- local: installation
title: Installation
- local: customization
title: Customize your training
title: Customize your Training
- local: logging
title: Understanding logs
title: Understanding Logs
title: Get started
- sections:
- local: models
title: Model Classes
- local: trainer
title: Trainer Classes
- local: reward_trainer
title: Training your own reward model
title: Reward Model Training
- local: sft_trainer
title: Supervised fine-tuning
- local: extras
title: Extras - Better model output without reinforcement learning
title: Supervised Fine-Tuning
- local: best_of_n
title: Best of N Sampling
title: API
- sections:
- local: sentiment_tuning
title: Sentiment Tuning
- local: lora_tuning_peft
title: Peft support - Low rank adaption of 8 bit models
- local: summarization_reward_tuning
title: Summarization Reward Tuning
title: Training with PEFT
- local: detoxifying_a_lm
title: Detoxifying a Language Model
- local: using_llama_models
title: Using LLaMA with TRL
title: Training StackLlama
- local: multi_adapter_rl
title: Multi Adapter RL (MARL) - a single base model for everything
title: Multi Adapter RLHF
title: Examples
2 changes: 1 addition & 1 deletion docs/source/extras.mdx → docs/source/best_of_n.mdx
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Extras: Alternative ways to get better model output without RL based fine-tuning
# Best of N sampling: Alternative ways to get better model output without RL based fine-tuning

Within the extras module is the `best-of-n` sampler class that serves as an alternative method of generating better model output.
As to how it fares against the RL based fine-tuning, please look in the `examples` directory for a comparison example
Expand Down
6 changes: 3 additions & 3 deletions docs/source/detoxifying_a_lm.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -8,8 +8,8 @@ Here's an overview of the notebooks and scripts in the [TRL toxicity repository]

| File | Description | Colab link |
|---|---| --- |
| [`gpt-j-6b-toxicity.py`](https://github.com/lvwerra/trl/blob/main/examples/toxicity/scripts/gpt-j-6b-toxicity.py) | Detoxify `GPT-J-6B` using PPO | x |
| [`evaluate-toxicity.py`](https://github.com/lvwerra/trl/blob/main/examples/toxicity/scripts/evaluate-toxicity.py) | Evaluate de-toxified models using `evaluate` | x |
| [`gpt-j-6b-toxicity.py`](https://github.com/lvwerra/trl/blob/main/examples/research_projects/toxicity/scripts/gpt-j-6b-toxicity.py) | Detoxify `GPT-J-6B` using PPO | x |
| [`evaluate-toxicity.py`](https://github.com/lvwerra/trl/blob/main/examples/research_projects/toxicity/scripts/evaluate-toxicity.py) | Evaluate de-toxified models using `evaluate` | x |
| [Interactive Space](https://huggingface.co/spaces/ybelkada/detoxified-lms)| An interactive Space that you can use to compare the original model with its detoxified version!| x |

## Context
Expand Down Expand Up @@ -174,7 +174,7 @@ Below are few generation examples of `gpt-j-6b-detox` model:
<img src="https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/trl-toxicity-examples.png">
</div>

The evaluation script can be found [here](https://github.com/lvwerra/trl/blob/main/examples/toxicity/scripts/evaluate-toxicity.py).
The evaluation script can be found [here](https://github.com/lvwerra/trl/blob/main/examples/research_projects/toxicity/scripts/evaluate-toxicity.py).

### Discussions

Expand Down
26 changes: 24 additions & 2 deletions docs/source/index.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,28 @@

# TRL - Transformer Reinforcement Learning

With the TRL (Transformer Reinforcement Learning) library you can train transformer language models with reinforcement learning. The library is integrated with 🤗 [transformers](https://github.com/huggingface/transformers).
TRL is a full stack library where we provide a set of tools to train transformer language models with Reinforcement Learning, from the Supervised Fine-tuning step (SFT), Reward Modeling step (RM) to the Proximal Policy Optimization (PPO) step.
The library is integrated with 🤗 [transformers](https://github.com/huggingface/transformers).

TRL supports decoder models such as GPT-2, BLOOM, GPT-Neo which can all be optimized using Proximal Policy Optimization (PPO). You can find installation instructions in the [installation guide](installation) and an introduction to the library in the [Quickstart section](quickstart). There is also a more [in-depth example](sentiment_tuning) to tune GPT-2 to produce positive movie reviews.
<div style="text-align: center">
<img src="https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/TRL-readme.png">
</div>

Check the appropriate sections of the documentation depending on your needs:

API documentation:

Comment on lines +16 to +17
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about adding 1 sentence here per bullet point for people who don't know e.g. what an SFTTrainer could do or what a StackLlama is :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense!

- [Model Classes](models): *A brief overview of what each public model class does.*
- [`SFTTrainer`](sft_trainer): *Supervise Fine-tune your model easily with `SFTTrainer`*
- [`RewardTrainer`](reward_trainer): *Train easily your reward model using `RewardTrainer`.*
- [`PPOTrainer`](trainer): *Further fine-tune the supervised fine-tuned model using PPO algorithm*
- [Best-of-N Samppling](best-of-n): *Use best of n sampling as an alternative way to sample predictions from your active model*


Examples:

- [Sentiment Tuning](sentiment_tuning): *Fine tune your model to generate positive movie contents*
- [Training with PEFT](lora_tuning_peft): *Memory efficient RLHF training using adapters with PEFT*
- [Detoxifying LLMs](detoxifying_a_lm): *Detoxify your language model through RLHF*
- [StackLlama](using_llama_models): *End-to-end RLHF training of a Llama model on Stack exchange dataset*
- [Multi-Adapter Training](multi_adapter_rl): *Use a single base model and multiple adapters for memory efficient end-to-end training*
13 changes: 3 additions & 10 deletions docs/source/lora_tuning_peft.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -7,14 +7,9 @@ Here's an overview of the `peft`-enabled notebooks and scripts in the [trl repos

| File | Task | Description | Colab link |
|---|---| --- |
| [`gpt2-sentiment_peft.py`](https://github.com/lvwerra/trl/blob/main/examples/sentiment/scripts/gpt2-sentiment_peft.py) | Sentiment | Same as the sentiment analysis example, but learning a low rank adapter on a 8-bit base model | |
| [`cm_finetune_peft_imdb.py`](https://github.com/lvwerra/trl/blob/main/examples/sentiment/scripts/gpt-neox-20b_peft/cm_finetune_peft_imdb.py) | Sentiment | Fine tuning a low rank adapter on a frozen 8-bit model for text generation on the imdb dataset. | |
| [`merge_peft_adapter.py`](https://github.com/lvwerra/trl/blob/main/examples/sentiment/scripts/gpt-neox-20b_peft/merge_peft_adapter.py) | 🤗 Hub | Merging of the adapter layers into the base model’s weights and storing these on the hub. | |
| [`gpt-neo-20b_sentiment_peft.py`](https://github.com/lvwerra/trl/blob/main/examples/sentiment/scripts/gpt-neox-20b_peft/gpt-neo-20b_sentiment_peft.py) | Sentiment | Sentiment fine-tuning of a low rank adapter to create positive reviews. | |
| [`gpt-neo-1b_peft.py`](https://github.com/lvwerra/trl/blob/main/examples/sentiment/scripts/gpt-neo-1b-multi-gpu/gpt-neo-1b_peft.py) | Sentiment | Sentiment fine-tuning of a low rank adapter to create positive reviews using 2 GPUs. | |
| [`stack_llama/rl_training.py`](https://github.com/lvwerra/trl/blob/main/examples/stack_llama/scripts/rl_training.py) | RLHF | Distributed fine-tuning of the 7b parameter LLaMA models with a learned reward model and `peft`. | |
| [`stack_llama/reward_modeling.py`](https://github.com/lvwerra/trl/blob/main/examples/stack_llama/scripts/reward_modeling.py) | Reward Modeling | Distributed training of the 7b parameter LLaMA reward model with `peft`. | |
| [`stack_llama/supervised_finetuning.py`](https://github.com/lvwerra/trl/blob/main/examples/stack_llama/scripts/supervised_finetuning.py) | SFT | Distributed instruction/supervised fine-tuning of the 7b parameter LLaMA model with `peft`. | |
| [`stack_llama/rl_training.py`](https://github.com/lvwerra/trl/blob/main/examples/research_projects/stack_llama/scripts/rl_training.py) | RLHF | Distributed fine-tuning of the 7b parameter LLaMA models with a learned reward model and `peft`. | |
| [`stack_llama/reward_modeling.py`](https://github.com/lvwerra/trl/blob/main/examples/research_projects/stack_llama/scripts/reward_modeling.py) | Reward Modeling | Distributed training of the 7b parameter LLaMA reward model with `peft`. | |
| [`stack_llama/supervised_finetuning.py`](https://github.com/lvwerra/trl/blob/main/examples/research_projects/stack_llama/scripts/supervised_finetuning.py) | SFT | Distributed instruction/supervised fine-tuning of the 7b parameter LLaMA model with `peft`. | |

## Installation
Note: peft is in active development, so we install directly from their Github page.
Expand Down Expand Up @@ -132,8 +127,6 @@ Simply load your model with a custom `device_map` argument on the `from_pretrain

Also make sure to have the `lm_head` module on the first GPU device as it may throw an error if it is not on the first device. As this time of writing, you need to install the `main` branch of `accelerate`: `pip install git+https://github.com/huggingface/accelerate.git@main` and `peft`: `pip install git+https://github.com/huggingface/peft.git@main`.

That all you need to do to use NPP. Check out the [gpt-neo-1b_peft.py](https://github.com/lvwerra/trl/blob/main/examples/sentiment/scripts/gpt-neo-1b-multi-gpu/gpt-neo-1b_peft.py) example for a more details usage of NPP.

### Launch scripts

Although `trl` library is powered by `accelerate`, you should run your training script in a single process. Note that we do not support Data Parallelism together with NPP yet.
Expand Down
2 changes: 1 addition & 1 deletion docs/source/multi_adapter_rl.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ You just need to install `peft` and optionally install `bitsandbytes` as well if
You need to address this approach in three stages that we summarize as follows:

1- Train a base model on the target domain (e.g. `imdb` dataset) - this is the Supervised Fine Tuning stage - it can leverage the `SFTTrainer` from TRL.
2- Train a reward model using `peft`. This is required in order to re-use the adapter during the RL optimisation process (step 3 below). We show an example of leveraging the `RewardTrainer` from TRL in [this example](https://github.com/lvwerra/trl/tree/main/examples/0-abstraction-RL/reward_modeling.py)
2- Train a reward model using `peft`. This is required in order to re-use the adapter during the RL optimisation process (step 3 below). We show an example of leveraging the `RewardTrainer` from TRL in [this example](https://github.com/lvwerra/trl/tree/main/examples/scripts/reward_trainer.py)
3- Fine tune new adapters on the base model using PPO and the reward adapter. ("0 abstraction RL")

Make sure to use the same model (i.e. same architecure and same weights) for the stages 2 & 3.
Expand Down
2 changes: 2 additions & 0 deletions docs/source/reward_trainer.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,8 @@

TRL supports custom reward modeling for anyone to perform reward modeling on their dataset and model.

Check out a complete flexible example inside [`examples/scripts`](https://github.com/lvwerra/trl/tree/main/examples/scripts/reward_trainer.py) folder.

## Expected dataset format

The reward trainer expects a very specific format for the dataset. Since the model will be trained to predict which sentence is the most relevant, given two sentences. We provide an example from the [`Anthropic/hh-rlhf`](https://huggingface.co/datasets/Anthropic/hh-rlhf) dataset below:
Expand Down
15 changes: 9 additions & 6 deletions docs/source/sentiment_tuning.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -6,12 +6,11 @@ Here's an overview of the notebooks and scripts in the [trl repository](https://

| File | Description | Colab link |
|---|---| --- |
| [`gpt2-sentiment.ipynb`](https://github.com/lvwerra/trl/blob/main/examples/sentiment/notebooks/gpt2-sentiment.ipynb) | Fine-tune GPT2 to generate positive movie reviews. | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/lvwerra/trl/blob/main/examples/sentiment/notebooks/gpt2-sentiment.ipynb)
| [`gpt2-sentiment.ipynb`](https://github.com/lvwerra/trl/blob/main/examples/notebooks/gpt2-sentiment.ipynb) | Fine-tune GPT2 to generate positive movie reviews. | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/lvwerra/trl/blob/main/examples/sentiment/notebooks/gpt2-sentiment.ipynb)
|
| [`gpt2-sentiment-control.ipynb`](https://github.com/lvwerra/trl/blob/main/examples/sentiment/notebooks/gpt2-sentiment-control.ipynb) | Fine-tune GPT2 to generate movie reviews with controlled sentiment. | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/lvwerra/trl/blob/main/examples/sentiment/notebooks/gpt2-sentiment-control.ipynb)
| [`gpt2-sentiment-control.ipynb`](https://github.com/lvwerra/trl/blob/main/examples/notebooks/gpt2-sentiment-control.ipynb) | Fine-tune GPT2 to generate movie reviews with controlled sentiment. | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/lvwerra/trl/blob/main/examples/sentiment/notebooks/gpt2-sentiment-control.ipynb)
|
| [`gpt2-sentiment.py`](https://github.com/lvwerra/trl/blob/main/examples/sentiment/scripts/gpt2-sentiment.py) | Same as the notebook, but easier to use to use in multi-GPU setup. | x |
| [`t5-sentiment.py`](https://github.com/lvwerra/trl/blob/main/examples/sentiment/scripts/t5-sentiment.py) | Same as GPT2 script, but for a Seq2Seq model (T5). | x |
| [`gpt2-sentiment.py`](https://github.com/lvwerra/trl/blob/main/examples/ppo_trainer/sentiment_tuning.py) | Same as the notebook, but easier to use to use in multi-GPU setup with any architecture. | x |


## Installation
Expand All @@ -31,5 +30,9 @@ The `trl` library is powered by `accelerate`. As such it is best to configure an

```bash
accelerate config # will prompt you to define the training configuration
accelerate launch scripts/gpt2-sentiment.py # launches training
```
accelerate launch yourscript.py # launches training
```

## Few notes on multi-GPU

To run in multi-GPU setup with DDP (distributed Data Parallel) change the `device_map` value to `device_map={"": Accelerator().process_index}` and make sure to run your script with `accelerate launch yourscript.py`. If you want to apply naive pipeline parallelism you can use `device_map="auto"`.
2 changes: 2 additions & 0 deletions docs/source/sft_trainer.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,8 @@

Supervised fine-tuning (or SFT for short) is a crucial step in RLHF. In TRL we provide an easy-to-use API to create your SFT models and train them with few lines of code on your dataset.

Check out a complete flexible example inside [`examples/scripts`](https://github.com/lvwerra/trl/tree/main/examples/scripts/sft_trainer.py) folder.

## Quickstart

If you have a dataset hosted on the 🤗 Hub, you can easily fine-tune your SFT model using [`SFTTrainer`] from TRL. Let us assume your dataset is `imdb`, the text you want to predict is inside the `text` field of the dataset, and you want to fine-tune the `facebook/opt-350m` model.
Expand Down
30 changes: 0 additions & 30 deletions docs/source/summarization_reward_tuning.mdx

This file was deleted.

4 changes: 4 additions & 0 deletions docs/source/trainer.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,10 @@ We also support a `RewardTrainer` that can be used to train a reward model.

[[autodoc]] RewardTrainer

## SFTTrainer

[[autodoc]] SFTTrainer

## set_seed

[[autodoc]] set_seed
Loading