Awesome Instruction Datasets

中文 | English

Introduction

"Welcome to 'awesome-prompt-datasets', a comprehensive collection of high-quality open-source instruction tuning datasets to train chat-based LLMs (ChatGPT,LLaMA,Alpaca)。

Instruction Tuning / Reinforcement Learning from Human Feedback (RLHF) Dataset is a key component of instruction-following LLMs such as ChatGPT. This repo is dedicated to providing a comprehensive list of datasets used for instruction tuning in various LLMs, making it easier for researchers and developers to access and utilize these resources.

With 'awesome-prompt-dataset', you can accelerate your research and development in NLP and unlock new opportunities for innovation. Let's explore the possibilities together!"

Prompt Datasets

Referring to this (@yaodongC), we labeled each collected dataset according to the following rules:

(Lang)Lingual-Tags:

EN: Instruction datasets in English
CN: Instruction datasets in Chinese
ML: [Multi-lingual] Instruction datasets in multiple languages

(Task)Task-Tags:

MT: [Multi-task] Datasets containing multiple tasks
TS: [Task-specific] Datasets tailored for specific tasks

(Gen)Generation-method:

HG: [Human Generated Dataset] Datasets created by humans
SI: [Self-Instruct] Datasets generated using self-instruct methods
MIX: [Mixed Dataset] Dataset contains both human and machine generated data
COL: [Collection of Dataset] Dataset made from a collection of other datasets

Statistics

Project	Datasets	Org	Nums	Lang	Task	Gen	Type	Src	Url
Chain of Thought	cot_data \|few_shot_data	Google	74771	EN/CN	MT	HG	instruct with cot reasoning	annotating CoT on existing data	download
GPT4all	nomic-ai/gpt4all-j-prompt-generations	nomic-ai	806199	EN	MT	COL	code, storys and dialogs	distillation from GPT-3.5-turbo	download
GPTeacher	GPT-4 General-Instruct \|Roleplay-Instruct \|Code-Instruct \| Toolformer	teknium1	29013	EN	MT	SI	general, roleplay, toolformer	GPT-4 & toolformer	download
Guanaco	JosephusCheung/GuanacoDataset	JosephusCheung	534610	ML	MT	SI	various linguistic tasks	text-davinci-003	download
HC3	Hello-SimpleAI/HC3	Hello-SimpleAI \| 万得资讯	37175	EN/CN	TS	MIX	dialogue evaluation	human or ChatGPT	download
HC3-Chinese	Hello-SimpleAI/HC3-Chinese	Hello-SimpleAI\|万得资讯	13k	CN	TS	MIX	dialogue evaluation	human or ChatGPT
alpaca	tatsu-lab/alpaca	tatsu-lab	52002	EN	MT	SI	general instruct	text-davinci-003	download
AlpacaDataCleaned	yahma/alpaca-cleaned	yahma	52k	EN	MT	SI	general instruct	text-davinci-003	download
Chinese-LLaMA-Alpaca	alpaca_data_zh_51k	ymcui(讯飞)	51k	CN	MT	SI	general instruct	text-davinci-003
Luotuo-Chinese-LLM 骆驼	trans_chinese_alpaca_data	LC1332(商汤)	52k	CN	MT	SI	general instruct	text-davinci-003
Natural Instructions	Allen AI 61 task\|1.5k task	Allen AI	5040134	ML	MT	COL	diverse nlp tasks	human annotated datasets collection	download
belle_cn	BelleGroup/train_1M_CN \|BelleGroup/train_0.5M_CN	BelleGroup(链家)	1079517	CN	TS/MT	SI	general, mathematical reasoning, dialogue	text-davinci-003	download
instinwild	instinwild_ch \| instinwild_en		52191	EN/CN	MT	SI	generation, open-qa, mind-storm	text-davinci-003	download
华驼(HuaTuo)	中文医学知识 \|肝癌	SCIR-HI(哈工大)	8K	CN	TS	SI	公开和自建的中文医学知识库	GPT3.5
prosocial dialog	allenai/prosocial-dialog	allenai	165681	EN	TS	MIX	dialogue	GPT-3 rewrites questions + humans feedback manually	download
finance_en	gbharti/finance-alpaca		68912	EN	TS	COL	financial related qa	GPT3.5	download
xP3	bigscience/xP3	bigscience	78883588	ML	MT	COL	a collection of prompts & datasets across 46 of languages & 16 NLP tasks	human annotated datasets collection	download
firefly	YeungNLP/firefly-train-1.1M		1649398	CN	MT	COL	23 nlp tasks	human annotated datasets collection	download
instruct	swype/instruct		888969	EN	MT	COL	augmented of GPT4All, Alpaca, open-source Meta datasets	augmentation performed using the advanced NLP tools provided by AllenAI	download
Code Alpaca	sahil280114/codealpaca		20022	EN	TS	SI	code generation, editing, optimization	text-davinci-003	download
Alpaca_GPT4	alpaca_gpt4_data\|alpaca_gpt4_data_zh \|comparison_data_v2	微软	52002	EN/CN	MT	SI	general instruct	generated by GPT-4 using Alpaca	download
webGPT	openai/webgpt_comparisons	openai	18994	EN	TS	MIX	information retrieval (IR) QA	fine-tuned GPT-3, each instruction has two outputs, select better one	download
dolly 2.0	databricks/databricks-dolly-15k	databricks	15015	EN	TS	HG	closed QA , summarization and etc, Wikipedia as references	human annotated	download
mosaicml/llm-foundry	mosaicml/dolly_hhrlhf	mosaicml	59.3K	EN	TS	HG	This dataset is a combination of Databrick's dolly-15k dataset and a filtered subset of Anthropic's HH-RLHF.	human annotated
baize 白泽	alpaca_chat_data.json \|medical_chat_data.json \| quora_chat_data.json \|stackoverflow_chat_data.json	project-baize	653699	EN	MT	COL	a collection from Alpaca, Quora, StackOverFlow and MedQuAD questions	human annotated datasets collection	download
hh-rlhf	Anthropic/hh-rlhf	Anthropic	284517	EN	TS	MIX	dialogue	dialog between human and RLHF models	download
OIG(part)	laion/OIG	laion	49237	EN	MT	COL	created from various tasks, such as question and answering	using data augmentation, human annotated datasets collection	download
GAOKAO	Fill-in-the-blank_Questions \| Multiple-choice_Questions \| Open-ended_Questions	OpenLMLab	2785	CN	MT	COL	Multiple-choice, Fill-in-the-blank and Open-ended questions from examination	human annotated	download
camel \| 骆驼	camel-ai/code\|camel-ai/biology \|camel-ai/physics \|camel-ai/chemistry \|camel-ai/math	camel-ai	760620	EN	MT	SI	Role-Playing conversations in AI Society, Code, Math, Physics, Chemistry, Biolog	gpt-3.5-turbo	download
FLAN-Muffin	Muennighoff/flan		1764800	EN	MT	COL	60 nlp tasks	human annotated datasets collection	download
COIG	COIG	BAAI\|智源	298428	CN	MT	COL	collect fron Exam, Translated, Human Value Alignment Instructions and Counterfactural Correction Multi-round Chat	using automatic tool and manual verification	download
GPT4Tools	gpt4tools_71k.json	StevenGrove	71446	EN	MT	SI	a collection of tool-related instructions	gpt-3.5-turbo	download
ShareChat	RyokoAI/ShareGPT52K	RyokoAI	1663241	EN	MT	MIX	general instruct	crowdsourcing to collect conversations between people and ChatGPT (ShareGPT)	download
Auto CoT	kojima-takeshi188/zero_shot_cot/dataset \|kojima-takeshi188/zero_shot_cot/log	amazon-science		EN					download
MOSS（复旦 Moss）	fnlp/moss-002-sft-data\| moss-003-sft-data	fnlp	1583595	EN/CN	SI				download
ultrachat	stingning/ultrachat	thnlp	28247446	EN					download
StackLLaMA	lvwerra/stack-exchange-paired		todo	EN		HG
Self-Instruct	yizhongw/self-instruct		82 K	EN	SI	SI
Zhihu-KOL	Zhihu-KOL	Openassisent	100 w		SI	HG	Zhihu data for training Open Assitant
stanfordnlp/SHP	stanfordnlp/SHP	stanfordnlp	385 k	EN	MT	HG		human preferences over responses
LAION-AI/Open-Assistant	OpenAssistant/oasst1	Openassisent	84.4k	EN	MT	HG	OpenAssistant Conversations Dataset (OASST1)	human-generated, human-annotated
akoksal/LongForm	akoksal/LongForm	akoksal/LongForm	30k	EN	SI	HG		们从现有语料库（如 C4 和维基百科）中选择一组不同的人工文档，并通过 LLM 为给定的文档生成指令。
sail-sg/symbolic-instruction-tuning	sail/symbolic-instruction-tuning	sail-sg	800K	ML	SI			Human Synthetic Examples
医疗问答 michael-wzhu/PromptCBLUE	michaelwzhu/ChatMed_Consult_Dataset	michael-wzhu	110113	CN	SI			互联网上的医疗问诊问题(110,113)，反映了真实世界的不同用户/患者的医疗问诊需求。目前response都是由OpenAI `GPT-3.5`引擎回答的。
mbzuai-nlp/LaMini-LM	MBZUAI/LaMini-instruction	MBZUAI/LaMini-instruction	2.58M	EN	MT	SI		通过离线蒸馏从大型语言模型中提取知识
pCLUE	pCLUE		120 万
WizardLM	victor123/evol_instruct_70k	WizardLM	70k	EN	MT

RLHF Datasets

Statistics

Project	Links	Org	Nums	Lang	Summary
webgpt_comparisons		Openai	19,578	English	In the WebGPT paper, the authors trained a reward model from human feedback. They used the reward model to train a long form question answering model to align with human preferences. This is the dataset of all comparisons that were marked as suitable for reward modeling by the end of the WebGPT project. There are 19,578 comparisons in total.
SHP		stanfordnlp	349 K	English	SHP is a dataset of 385K collective human preferences over responses to questions/instructions in 18 different subject areas, from cooking to legal advice. The preferences are meant to reflect the helpfulness of one response over another, and are intended to be used for training RLHF reward models and NLG evaluation models (e.g., SteamSHP).
rlhf-reward-datasets		yitingxie	76.3 k	English
Dahoas/full-hh-rlhf		Dahoas	112 k	English	Anthropic's HH dataset reformatted into prompt, chosen, rejected samples.
Dahoas/synthetic-instruct-gptj-pairwise		Dahoas		English
Dahoas/rm-static		Dahoas	76.3k	English	Split of hh-static used for training reward models after supervised fine-tuning.
Anthropic/hh-rlhf		Anthropic	22k	English	This RLHF dataset is an iterated 'online' dataset that includes data from 52B language models. It contains 22k helpfulness comparisons and no red-teaming data.
Instruction-Tuning-with-GPT-4/GPT-4-LLM		Instruction-Tuning-with-GPT-4	52k	English	Ranked responses (Note: Data is evaluated by `GPT-4` model NOT human) of Alpaca prompts from three models (GPT-4, GPT-3.5 and OPT-IML) by asking GPT-4 to rate the quality. Author believes "GPT-4 is capable of identifying and fixing its own mistakes, and accurately judging the quality of responses"
thu-coai/Safety-Prompts	thu-coai/Safety-Prompts	thu-coai	100k	Chinese	中文安全prompts，用于评测和提升大模型的安全性，将模型的输出与人类的价值观对齐。
Chatgpt-Comparison-Detection project	Hello-SimpleAI/HC3		24.3K	English	Human ChatGPT Comparison Corpus, 60k human answers and 27K ChatGPT answers for around 24K questions.

Open ChatLLMs

Release	Model_name	Base	Model_Size	Datasets	Number of Instances	Language
2022-12	GPT-3 Self Inst.	GPT-3	175B	Self-Instruct	82 k	En
2023-03-03	alpaca	LLaMA	7B	alpaca_data	52 k	En
2023-03-19	alpaca-lora	LLaMA	7B 13B 30B	alpaca_data、alpaca_data_cleaned	52 k	En
2023-03-23	Chinese-Vicuna	LLaMA	7B 13B	BELLE、GuanacoDataset	1M	Zh
2023-03-24	Alpaca-CoT	LLaMA	7B	dataset	----	En Zh
2023-03-25	dolly	dolly	6B	alpaca_data	52 k	En
2023-03-25	guanaco	LLaMA	7B	GuanacoDataset	534 k	En Zh Ja De
2023-03-28	Chinese-LLaMA-Alpaca	LLaMA	7B	alpaca_data_zh、pCLUE、translation2019zh、alpaca_data、Self-Instruct	2M	Zh
2023-03-29	ColossalChat	LLaMA	7B 13B	InstructionWild	104 k	En Zh
2023-03-31	Luotuo	LLaMA ChatGLM	7B 6B	trans_chinese_alpaca_data	52k	Zh
2023-03-31	cerebras-lora-alpaca	Cerebras-GPT	2.7B	AlpacaDataCleaned	52k	En

The template

Append the new project at the end of file

[{Project-name}/{Dataset-name}]{https://github.com/link/to/project}

- [paper/project link](link)
- [dataset link](link)
- Related work: (if applicable)

Some introductions ...

The Prompt Datasets List

Alpaca -Stanford

Paper/Project Link
Dataset Link
Data generation model: text-davinci-003
Cost: $600

The Alpaca of the Stanford release is a fine-tuning model for instruct-tuning based on the Meta Ai LLaMA model.

Alpaca automatically generated 52k instruction data using GPT-3.5 and used it to fine-tune the LLaMA model. Experimental results show that it can reach or even exceed the performance of GPT-3.5 on some tasks.

Instruction in the Wild

Paper/Project Link
Dataset Link
Data generation model: text-davinci-003

Instruction Tuning is a key component of ChatGPT. OpenAI used their user-based Instruction dataset, but unfortunately, this dataset is not open-sourced. Self-Instruct released a small instruction dataset including 175 instructions written by human labors. Standford Alpaca Team generated 52K instructions by text-davinci-003 model based on the the 175 seed instructions above.

This project targets on a larger and more diverse instruction dataset. To this end, we collected 429 instructions from ChatGPT usage screenshots and released both English and Chinese versions. We found these instructions are very diverse even if the scale is still small. We follow Alpaca to generate 52K instructions and their responses. All data can be found in data dir.

Note: This is an ongoing project. We are still collecting and improving our data. We release this dataset as early as possible to speedup our LLM research. We will also release a whitepaper soon.

JosephusCheung/GuanacoDataset

Data generation model: text-davinci-003
Cost: $6000

52K instruction data generated from modified self-instruct pipeline with human written 429 seed task.

Stanford Human Preferences Dataset (SHP)

DataLinks

SHP is a dataset of 385K collective human preferences over responses to questions/instructions in 18 different subject areas, from cooking to legal advice. The preferences are meant to reflect the helpfulness of one response over another, and are intended to be used for training RLHF reward models and NLG evaluation models (e.g., SteamSHP).

Each example is a Reddit post with a question/instruction and a pair of top-level comments for that post, where one comment is more preferred by Reddit users (collectively). SHP exploits the fact that if comment A was written after comment B but has a higher score nonetheless, then A is ostensibly more preferred to B. If A had been written before B, then we could not conclude this, since its higher score could have been the result of more visibility. We chose data where the preference label is intended to reflect which response is more helpful rather than which is less harmful, the latter being the focus of much past work.

How is SHP different from Anthropic's HH-RLHF dataset? Most notably, all the data in SHP is naturally occurring and human-written, whereas the responses in HH-RLHF are machine-written, giving us two very different distributions that can complement each other.

Hello-SimpleAI/HC3

Summary:The the first human-ChatGPT comparison corpus (English Version), named HC3 dataset
Data generation model: gpt-3.5, human generated
paper: How Close is ChatGPT to Human Experts? Comparison Corpus, Evaluation, and Detection
Cost: N/A

Hello-SimpleAI/HC3-Chinese

Summary:The the first human-ChatGPT comparison corpus (Chinese Version), named HC3 dataset
Data generation model: gpt-3.5, human generated
paper: How Close is ChatGPT to Human Experts? Comparison Corpus, Evaluation, and Detection
Cost: N/A

allenai/prosocial-dialog

Summary: ProsocialDialog is the first large-scale multi-turn English dialogue dataset to teach conversational agents to respond to problematic content following social norms.
Data generation model: gpt-3.5, human generated
paper: ProsocialDialog: A Prosocial Backbone for Conversational Agents
Cost: N/A

allenai/natural-instructions

Summary: A community effort to create a large collection of 1,616 diverse NLP tasks and their natural language definitions/instructions.
Data generation model: Human generated
paper: Super-NaturalInstructions: Generalization via Declarative Instructions on 1600+ NLP Tasks
Cost: N/A

PhoebusSi/Alpaca-CoT

Summary: A datset for Chain-of-Thoughts reasoning based on LLaMA and Alpaca. Note: Their repository will continuously collect various instruction tuning datasets. Github Repo
paper: N/A
Cost: N/A

nomic-ai/gpt4all

Summary: gpt4all leverages three publicly available datasets: 1.laion/OIG, 2.pacovaldez/stackoverflow-questions 3. subset of bigscience/bloomz-p3
Data generation model: N/A
paper: GPT4All: Training an Assistant-style Chatbot with Large Scale Data Distillation from GPT-3.5-Turbo
Cost: $500

bigscience/xP3

Summary: [Prompt-resource] xP3 (Crosslingual Public Pool of Prompts) is a collection of prompts & datasets across 46 of languages & 16 NLP tasks.
Data generation model: N/A
paper: Crosslingual Generalization through Multitask Finetuning
Cost: N/A

teknium1/GPTeacher

Summary: A collection of modular datasets generated by GPT-4, General-Instruct - Roleplay-Instruct - Code-Instruct - and Toolformer
Data generation model: GPT-4
paper: N/A
Cost: N/A

thunlp/UltraChat

Summary: UltraChat aims to construct an open-source, large-scale, and multi-round dialogue data. The first part of UltraChat (i.e., the Questions about the World sector) is released, which contains 280k diverse and informative dialogues. More dialogues about writing and creation, assistance on existing materials are to come.
Data generation model: GPT-3.5-turbo
paper: N/A
Cost: N/A

cascip/ChatAlpaca

Summary: Based on the Stanford Alpaca data, ChatAlpaca extends the data to multi-turn instructions and their corresponding responses. More data (20k) and the Chinese translated version are to come.
Data generation model: GPT-3.5-turbo
paper: N/A
Cost: N/A
Related: (tatsu-lab/Alpaca)|52K|EN|MT|SI

YeungNLP/firefly-train-1.1M)

Summary: Chinese datasets of 23 tasks combined with human-written instruction templates.
Data generation model: N/A
paper: N/A
Cost: N/A

orhonovich/unnatural-instructions

Summary: 64K examples by prompting a language model with three seed examples of instructions and eliciting a fourth. Then the set is expanded to 240K by prompting the model to rephrase each instruction.
Data generation model: text-davinci-002
paper: Unnatural Instructions: Tuning Language Models with (Almost) No Human Labor
Cost: N/A

Instruction-Tuning-with-GPT-4/GPT-4-LLM

Summary: 52K instruction-following data generated by GPT-4 with the original Alpaca prompts & Alpaca prompts translated into Chinese by ChatGPT + 9K instruction-following data generated by GPT-4 with prompts in Unnatural Instruction.
Data generation model: GPT-4
paper: Instruction Tuning with GPT-4
Cost: N/A
Related: -(tatsu-lab/Alpaca)|52K|EN|MT|SI -(orhonovich/unnatural-instructions)|240K|EN|MT|MIX

databrickslabs/dolly

Summary: This datset was generated by thousands of Databricks employees in several of the behavioral categories outlined in the InstructGPT paper, including brainstorming, classification, closed QA, generation, information extraction, open QA, and summarization.
Data generation model: N/A
paper: Free Dolly
Cost: N/A

OpenAssistant/oasst1

Summary: OpenAssistant Conversations (OASST1), a human-generated, human-annotated assistant-style conversation corpus consisting of 161,443 messages distributed across 66,497 conversation trees, in 35 different languages, annotated with 461,292 quality ratings.
Data generation model: N/A
paper: OpenAssistant Conversations - Democratizing Large Language Model Alignment
Cost: N/A

BELLE/data/1.5M

下载地址: https://github.com/LianjiaTech/BELLE/tree/main/data/1.5M
数据量: 1.5M
生成方式: self-instruct，使用了中文种子任务，以及openai的text-davinci-003接口
涉及任务: 包含175个种子任务，https://github.com/LianjiaTech/BELLE/blob/main/data/1.5M/zh_seed_tasks.json
数据示例: https://huggingface.co/datasets

alpaca_chinese_dataset

下载地址: https://github.com/hikariming/alpaca_chinese_dataset
数据量: 52k
生成方式: 借助chatgpt对原始的stanford_alpaca做机器翻译，并加入人工校验来保证质量
涉及任务: 与原始的stanford_alpaca一致，可以在原项目的seed_task.json中查到全部任务

Med-ChatGLM/data

下载地址: https://github.com/SCIR-HI/Med-ChatGLM
数据量: 7k
生成方式: 利用GPT3.5接口围绕医学知识库构建问答数据，并设置了多种Prompt形式来充分利用知识
涉及任务: 医学领域相关的问答，包含并发症，高危因素，组织学检查，临床症状，药物治疗，辅助治疗

pCLUE

下载地址: https://github.com/CLUEbenchmark/pCLUE
数据量: 1.2M
生成方式: 通过原有的NLP任务数据集，结合特定的prompt模板生成
涉及任务: 包含9个NLP数据集，涉及的NLP任务有文本分类/自然语言推理/语义匹配/指代消解/关键词识别/阅读理解

COIG

下载地址: https://huggingface.co/datasets/BAAI/COIG
数据量:
- Translated Instructions (67,798)
- Exam Instructions (63,532)
- Human Value Alignment Instructions (34,471)
- Counterfactural Correction Multi-round Chat (13,653)
- Leetcode Instructions (11,737)
生成方式: 融合了多个领域的数据，具体可以参考论文Chinese Open Instruction Generalist: A Preliminary Release

https://github.com/FreedomIntelligence/InstructionZoo

https://github.com/lightaime/camel

The RLHF Datasets List

Anthropic/hh-rlhf

Summary: This RLHF dataset is an iterated 'online' dataset that includes data from 52B language models. It contains 22k helpfulness comparisons and no red-teaming data.
Data generation model: Anthropic RL-CAI 52B
paper: Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
Cost: N/A

HuggingFaceH4/stack-exchange-preferences

Summary: This dataset contains questions and answers from the Stack Overflow Data Dump for the purpose of preference model training.
Data generation model: N/A
paper: A General Language Assistant as a Laboratory for Alignment
Cost: N/A

stanfordnlp/SHP

Summary: Each example is a Reddit post with a question/instruction and a pair of top-level comments for that post, where one comment is more preferred by Reddit users (collectively).
Data generation model: N/A
paper: N/A
Cost: N/A

Instruction-Tuning-with-GPT-4/GPT-4-LLM

Summary: Ranked responses (Note: Data is evaluated by GPT-4 model NOT human) of Alpaca prompts from three models (GPT-4, GPT-3.5 and OPT-IML) by asking GPT-4 to rate the quality. Author believes "GPT-4 is capable of identifying and fixing its own mistakes, and accurately judging the quality of responses"
Data generation model: GPT-4
paper: Instruction Tuning with GPT-4
Cost: N/A
Related: -(tatsu-lab/Alpaca)|52K|EN|MT|SI

Natural Instruction / Super-Natural Instruction

Allen AI is the first organization to try Instruction as a prompt and fine-tune LLMs. In the Natural Instruction paper, you can basically understand the labeling ideas of the instruction.

In its proposed dataset, 61 and different NLP tasks are included.

Super-Natural Instruction is a super-intensive version of Natural Instruction, which contains more than 1,600 different NLP tasks, and there are more than 76 different types of NLP tasks (such as: classification, extraction, sequence labeling).

BigScience/P3

BigScience is jointly organized by Hugging Face and French CNRS, IDRIS, GENCI, etc. It is one of the largest open source LLMs organizations.

BigScience developed the PromptSource project at the end of 2021, and open sourced a series of toolkits to help researchers build prompts based on existing NLP tasks. So far, the PromptSource project contains more than 2000 prompt templates for 270 NLP tasks.

On this basis, BigScience constructed the P3 dataset. You can find P3 data on Hugging Face Hub, and the data size of P3 is between 100M-1B.

xMTF - BigScience

Based on the English prompt, BigScience extends its prompt to multiple non-English languages.

The project contains 13 NLP tasks and is available in 46 different languages. The corresponding prompt contains an indeterminate number of languages.

After fine-tuning on the basis of multilingual, both BLOOM and T0 have realized the ideal multilingual ability.

HH-RLHF - Anthropic

Claud under Anthropic is one of the main competitors of ChatGPT.

Anthropic has open-sourced the RLHF dataset it uses in its own product line.

The original intention of the HH-RLHF project is to train Helpful and Harmless (HH) LLMs. Therefore, in addition to the quality of the project's responses, whether it is harmful information is also reflected in its human feedback.

The paper records how to use the behavior of the RLHF data Align model to human values, and records the construction method and standards of the data set.

Unnatural Instruction

Using LLMs to independently generate instruction data is an active direction in the field of instruction-tuning.

Unnatural Instruction uses GPT3 (text-davinci-002) to generate 64k instruction prompt data. And use the same model to rewrite the 64k prompt, and finally get 240k instruction data.

The paper shows that the prompts generated by LLMs in Instruct-Tuning show good results, even surpassing models such as T0 that are fine-tuned on P3 and other data.

Self-Instruct

Self-Instruct is also the idea of using LLMs to generate prompts for instruction-tuning. However, a more fine-grained generation process is used.

Concepts such as Task pool and Quality filtering were introduced to partially alleviate the noise problem of self-intrauct type data.

UnifiedSKG - HKU

UnifiedSKG has added knowledge grounding in the Text-to-Text framework, that is, in the prompt-output framework, it has added structured data for assistance.

As an example, some NLP tasks rely heavily on structured knowledge bases/databases. The idea of UnifiedSKG is to serialize the required database and embed it into the prompt. As shown below.

UnifiedSKG represents a direction in the field of LLMs that attempts to use structured knowledge to enhance performance.

Google/Flan Collection

In this project, Google merged its own Flan 2021 data with some open source instruction data (P3, super-natural instruction, etc.).

In Flan Collection's paper, Google also summarizes some key points in Flan series model training/reasoning, which may have good reference value.

The Flan Collection compiles datasets from Flan 2021, P3, Super-Natural Instructions, along with dozens more datasets into one place, formats them into a mix of zero-shot, few-shot and chain-of-thought templates

InstructDial

InstructDial is an attempt to fine-tune instructions on a specific task type. Experimental results show that after fine-tuning on dialogue instruction data, the model performs better on dialogue tasks than on very large-scale task sets.

ChatGPT Distillation Data

Public User-Shared Dialogues with ChatGPT (ShareGPT) Around 60K dialogues shared by users on ShareGPT were collected using public APIs. To maintain data quality, we deduplicated on the user-query level and removed any non-English conversations. This leaves approximately 30K examples.

Human ChatGPT Comparison Corpus (HC3) We use both the human and ChatGPT responses from the HC3 english dataset, which contains around 60K human answers and 27K ChatGPT answers for around 24K questions, resulting in a total number of around 87K question-answer examples.

Open Instruction Generalist (OIG).

We use a manually-selected subset of components from the Open Instruction Generalist dataset curated by LAION. Specifically, we use the grade-school-math-instructions, the poetry-to-songs, and the plot-screenplay-books-dialogue datasets. This results in a total of around 30k examples.

OpenAI WebGPT.

In the WebGPT paper, the authors trained a reward model from human feedback. They used the reward model to train a long form question answering model to align with human preferences. This is the dataset of all comparisons that were marked as suitable for reward modeling by the end of the WebGPT project. There are 19,578 comparisons in total.

Each example in the dataset contains a pair of model answers for a question, and the associated metadata. Each answer has a preference score from humans that can be used to determine which of the two answers are better.

OpenAI Summarization.

The OpenAI summarization dataset contains ~93K examples, each example consists of feedback from humans regarding the summarizations generated by a model. Human evaluators chose the superior summary from two options.

Datasets without license information

alespalla/chatbot_instruction_prompts

Summary: A compilation of tatsu-lab/alpaca ,Dahoas/instruct-human-assistant-prompt ,allenai/prosocial-dialog
Data generation model: N/A
paper: N/A
Cost: N/A

Contributing

Our purpose is to make this repo even better. If you are interested in contributing, please refer to HERE for instructions in contribution.

License

Awesome-Prompt-Dataset is released under the Apache 2.0 license.

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
README_zh.md		README_zh.md

License

jianzhnie/awesome-instruction-datasets

Folders and files

Latest commit

History

Repository files navigation

Awesome Instruction Datasets

Contents

Introduction

Prompt Datasets

Statistics

RLHF Datasets

Statistics

Open ChatLLMs

The template

The Prompt Datasets List

BELLE/data/1.5M

alpaca_chinese_dataset

Med-ChatGLM/data

pCLUE

COIG

The RLHF Datasets List

Natural Instruction / Super-Natural Instruction

xMTF - BigScience

HH-RLHF - Anthropic

InstructDial

ChatGPT Distillation Data

Open Instruction Generalist (OIG).

OpenAI WebGPT.

OpenAI Summarization.

Datasets without license information

Contributing

License

Reference

About

Topics

Resources

License

Stars

Watchers

Forks