Muse is a Python script for creating synthetic textbooks using Google's Gemini Pro API, providing free access at 60 req/min. On average, each request generates ~1000 tokens, resulting in 3.6 million tokens per API key per hour!
- Clone or download the repository to your local machine.
- Enter the directory.
- Run
python3 -m pip install -r requirements.txt
. - Run
python3 main.py
to generate aconfig.ini
file with default values. - Edit
config.ini
and add one or more API keys obtained from Google's Makersuite. Feel free to make other adjustments as well. - On Hugging Face, go to New Dataset and create a new dataset named
muse_textbooks
. - Go to Hugging Face Tokens to get your current access token (API key) or create a new one with write permission.
- Run
python3 main.py
; you'll be prompted to enter your Hugging Face API key.
The script submits 60 requests/minute to Google's Gemini Pro, generating textbooks in the specified (out_dir
) directory and auto-uploading them to a Hugging Face dataset named your_username/muse_textbooks
.
Note: Configuration values can be specified either in the config.ini file or passed as environment variables. It is advisable to avoid setting arbitrary values for temperature and top_p.
Run: nix-shell
inside the muse directory. Then follow instructions from the above-mentioned step 5.
By creating large amounts of open-source synthetic textbook data, we pave the way for open-source models that are more efficient and performant. phi-2 was trained on 250B tokens of mixed synthetic data and webtext; what might we be able to do with a 7B model trained on trillions of synthetic tokens?
The program auto-generates prompts by sampling from two seed datasets defined by the TEXT_DATASET
and CODE_DATASET
variables in data_processing.py
. After a passage is sampled, it is passed to one of three prompt templates defined by TEMPLATES
that instruct the LLM to either:
- Generate a two-person debate on a subject related to the passage, or
- Generate an informative lecture on a subject related to the passage, or
- Generate a computer science textbook section on a subject related to the passage.
This repository is open to any PRs/issues/suggestions :)