Framework for scalable synthetic data generation (SDG).
We recommend using a Python virtual environment with Python 3.9+. Here is how to setup a virtual environment using Python venv:
python3 -m venv ssdg_venv
source ssdg_venv/bin/activate
pip install .
Note: If you have used pyenv, Conda Miniforge or another tool for Python version management, then use the virtual environment with that tool instead. Otherwise, you may have issues with packages installed but modules from that package not found as they are linked to you Python version management tool and not venv
.
SDG uses Large Language Models (LLMs) to generate synthetic data. Scalable SDG therefore requires access to LLMs to inference or call the model. The following LLM inference APIs are supported:
Scalable SDG uses a .env
file to specify the configuration for the IBM GenAI and OpenAI APIs. The .env
file needs to be availabe from where the generate command is run from. There is a template env
file here.
The subsections that follow explain how to setup for the different APIs.
When using the IBM GenAI API, you need to:
- Add configuration to
env
file as follows:
GENAI_KEY=<genai key goes here>
GENAI_API=<genai api goes here>
- Install GenAI dependencies as follows:
pip install -e ".[genai]"
When using the OpenAI platform, you need to:
- Add configuration to
env
file as follows:
OPENAI_API_KEY=<openai api key goes here>
- Install OpenAI dependencies as follows:
pip install -e ".[openai]"
When using the vLLM batched inference, you need to:
- Install vLLM dependencies as follows:
pip install -e ".[vllm]"
Note: vLLM requires Linux OS and CUDA.
To get started with this example, make sure you have followed the Setup instructions, configured IBM GenAI, and/or configured vLLM
In this example, we will use the preloaded data files as the seed data to to generate the synthetic data.
The default data builder is set to run with the GenAI api unless overridden. We thus only need to run the following command (run from the root of the repository) to execute data generation with GenAI:
python -m fms_dgt.__main__ --data-path ./data/logical_reasoning/causal/qna.yaml
Alternatively, you can also use the CLI
fms_dgt --data-path ./data/logical_reasoning/causal/qna.yaml
For convenience, we have provided an additional configuration file that can be modified to test out using a local model with vLLM. First, open the config file and update the model field model_id_or_path
to substitute the <local-path-to-model>
variable with the path of a model that has been downloaded locally.
python -m fms_dgt.__main__ --data-path ./data/logical_reasoning/causal/qna.yaml --include-config-path ./configs/demo.yaml
Note: vLLM requires Linux OS and CUDA.
The generated data will be output to the following directory: output/causal/data->logical_reasoning->causal/generated_instructions.json
This example uses the SimpleInstructDataBuilder
as defined in ./fms_dgt/databuilders/simple/
. For more information on data builders and other components of Scalable SDG, take a look at the SDG Design doc.
Check out our contributing guide to learn how to contribute.
This repository is based on the Language Model Evaluation Harness which uses an MIT license.
@misc{eval-harness,
author = {Gao, Leo and Tow, Jonathan and Abbasi, Baber and Biderman, Stella and Black, Sid and DiPofi, Anthony and Foster, Charles and Golding, Laurence and Hsu, Jeffrey and Le Noac'h, Alain and Li, Haonan and McDonell, Kyle and Muennighoff, Niklas and Ociepa, Chris and Phang, Jason and Reynolds, Laria and Schoelkopf, Hailey and Skowron, Aviya and Sutawika, Lintang and Tang, Eric and Thite, Anish and Wang, Ben and Wang, Kevin and Zou, Andy},
title = {A framework for few-shot language model evaluation},
month = 12,
year = 2023,
publisher = {Zenodo},
version = {v0.4.0},
doi = {10.5281/zenodo.10256836},
url = {https://zenodo.org/records/10256836}
}