Skip to content

Latest commit

 

History

History
106 lines (82 loc) · 7.39 KB

Dataset.md

File metadata and controls

106 lines (82 loc) · 7.39 KB

Datasets and Evaluation Metrics

The provided fine tuning script allows you to select between three datasets by passing the dataset arg to the llama_recipes.finetuning module or examples/finetuning.py script. The current options are grammar_dataset, alpaca_datasetand samsum_dataset. Additionally, we integrate the OpenAssistant/oasst1 dataset as an example for a custom dataset Note: Use of any of the datasets should be in compliance with the dataset's underlying licenses (including but not limited to non-commercial uses)

  • grammar_dataset contains 150K pairs of english sentences and possible corrections.
  • alpaca_dataset provides 52K instruction-response pairs as generated by text-davinci-003.
  • samsum_dataset contains about 16k messenger-like conversations with summaries.
  • OpenAssistant/oasst1 contains about 88k messages from assistant-style conversations.

Batching Strategies

Llama-recipes support two strategies to batch requests together. The default setting is packing which concatenates the tokenized samples into long sequences filling up the context length of the model. This is the most compute efficient variant as it avoids any padding and all sequences have the same length. Samples at the boundary of the context length are truncated and the remainder of the cut sequence it used as the start of the next long sequence.

If the amount of training data is small this procedure might introduce a lot of noise into the training data which can hurt the prediction performance of the fine-tune model. Therefore, we also support a padding strategy which does not introduce the addition noise due to truncated sequences. The strategy tries to minimize the efficiency loss by batching samples of similar length together so only minimal padding is necessary.

The batching strategy can be selected though the command line parameter --batching_strategy [packing]/[padding].

Using custom datasets

The list of available datasets in llama-recipes is supposed to give users a quick start on training their Llama model. To use a custom dataset there are two possible ways. The first provides a function returning the dataset in a .py file which can be given to the command line tool. This does not involve changing the source code of llama-recipes. The second way is targeting contributions which extend llama-recipes as it involves changing the source code.

Training on custom data

To supply a custom dataset you need to provide a single .py file which contains a function with the following signature:

def get_custom_dataset(dataset_config, tokenizer, split: str):

For an example get_custom_dataset you can look at the provided datasets in llama_recipes.datasets or examples/custom_dataset.py. The dataset_config in the above signature will be an instance of llama_recipes.configs.dataset.custom_dataset with the modifications made through the command line. The split signals wether to return the training or validation dataset. The default function name is get_custom_dataset but this can be changed as described below.

In order to start a training with the custom dataset we need to set the --dataset as well as the --custom_dataset.file parameter.

python -m llama_recipes.finetuning --dataset "custom_dataset" --custom_dataset.file "examples/custom_dataset.py" [TRAINING PARAMETERS]

To change the function name that is used in the .py you can append the name following a : like this:

python -m llama_recipes.finetuning --dataset "custom_dataset" --custom_dataset.file "examples/custom_dataset.py:get_foo" [TRAINING PARAMETERS]

This will call the function get_foo instead of get_custom_dataset when retrieving the dataset.

Adding new dataset

Each dataset has a corresponding configuration (dataclass) in configs/datasets.py which contains the dataset name, training/validation split names, as well as optional parameters like datafiles etc.

Additionally, there is a preprocessing function for each dataset in the datasets folder. The returned data of the dataset needs to be consumable by the forward method of the fine-tuned model by calling model(**data). For CausalLM models this usually means that the data needs to be in the form of a dictionary with "input_ids", "attention_mask" and "labels" fields.

To add a custom dataset the following steps need to be performed.

  1. Create a dataset configuration after the schema described above. Examples can be found in configs/datasets.py.
  2. Create a preprocessing routine which loads the data and returns a PyTorch style dataset. The signature for the preprocessing function needs to be (dataset_config, tokenizer, split_name) where split_name will be the string for train/validation split as defined in the dataclass.
  3. Register the dataset name and preprocessing function by inserting it as key and value into the DATASET_PREPROC dictionary in utils/dataset_utils.py
  4. Set dataset field in training config to dataset name or use --dataset option of the llama_recipes.finetuning module or examples/finetuning.py training script.

Application

Below we list other datasets and their main use cases that can be used for fine tuning.

Q&A these can be used for evaluation as well

instruction finetuning

  • Alpaca 52k instruction tuning
  • Dolly 15k 15k instruction tuning

simple text generation for quick tests

English quotes 2508 Multi-label text classification, text generation

Reasoning used mostly for evaluation of LLMs

Toxicity evaluation

Bias evaluation

Useful Links

More information on evaluation dataset can be found in HELM