OpenBA🎓

This is the official code for OpenBA: An Open-Sourced 15B Bilingual Asymmetric Seq2Seq Model Pre-trained from Scratch

[中文版] [English]

News🔥

We have updated our paper (fix typos and add content about downstream adaptation). Please refer to our new paper OpenBA: An Open-Sourced 15B Bilingual Asymmetric Seq2Seq Model Pre-trained from Scratch
We have released four downstream versions of OpenBA
- OpenBA-Chat: Multi-turn Dialogue Model
- OpenBA-Code: Instruction-guided Code Generation Model
- OpenBA-InstructGen: Instruction Generation Model
- OpenBA-Tool: Retrieval Model with Tools

Content📝

Open Source Checklist
Overview of Training process
Evaluation Results
Usage
- Demo
- Training
Details
- Model Structure
- Data Collection
Disclaimers

Open Source Checklist

We are excited to unveil two distinguished versions of our model, with another on the horizon:

OpenBA-LM: The backbone language models was pre-trained on 340B English, Chinese, and code tokens.
OpenBA-Flan: We continually perform supervised fine-tuning with 40B tokens of constructed BiFlan Dataset. (Multi-lingual Instruction Model)
OpenBA-Chat: Multi-turn Dialogue Model
OpenBA-Code: Instruction-guided Code Generation Model
OpenBA-InstructGen: Instruction Generation Model
OpenBA-Tool: Retrieval Model with Tools

Overview of Training process

Evaluation Results

C-EVAL

Model performance on C-Eval benchmark, where #Param. denotes the model parameters, $*$ denotes chain-of-thought and Avg. denotes average accuracy. We report the 5-shot and 0-shot performance with diagonal bar division.

Model	#Param.	STEM	Social Science	Humanities	Others	Avg.	Avg.(Hard)
LLaMA	65B	37.8	45.6	36.1	37.1	38.8	31.7
ChatGLM	6B	33.3	48.3	41.3	38.0	38.9	29.2
Baichuan	7B	38.2	52.0	46.2	39.3	42.8	31.5
MOSS-moon-sft	16B	31.6	37.0	33.4	32.1	33.1	28.4
GLM-130B	130B	36.7	55.8	47.7	43.0	44.0	30.7
OpenBA	15B	34.8	46.6	41.1	41.5	39.8	31.1

BBH

Model performance on the BBH benchmark, where #Param. denotes the model parameters. We report the accuracy score for all the models.

Model	#Param.	BBH
ChatGLM	6B	31.3
Baichuan	7B	31.9
BatGPT	15B	34.1
MOSS	16B	29.3
OpenBA	15B	34.1

Reading Comprehension

Model performance on BELEBELE benchmark, where #Param. denotes the model parameters, $\dagger$ denotes 5-shot setting, $\ddagger$ denotes full fine-tuning in English and $*$ denotes the zero-shot setting for instructed models. We report the accuracy score for all the models.

Model	#Param.	eng_Latn	zho_Hans	zho_Hant	Avg.
Falcon $(†)$	40B	77.2	66.0	62.2	68.5
LLaMA $(†)$	70B	82.5	64.6	57.7	68.2
InfoXLM $(‡)$	550M	79.3	74.6	72.4	75.4
XLM-V $(‡)$	1.2B	76.2	71.0	67.1	71.4
LLaMA2-Chat $(*)$	70B	78.8	62.4	59.3	66.8
OpenBA $(*)$	15B	78.6	75.2	73.7	75.8

Machine Translation

Model performance on Flores subset containing 50 sentences sampled from Flores benchmark, where #Param. denotes the model parameters. We report BLEU for all the models.

Model	#Param.	Zh $\Rightarrow$ En	En $\Rightarrow$ Zh
ChatGLM	6B	17.2	32.5
Alpaca	7B	15.1	9.8
Alpaca-LoRA	7B	16.4	14.5
PARROT	7B	19.6	24.8
BatGPT	15B	23.1	38.7
MOSS	16B	17.2	32.5
OpenBA	15B	23.3	37.4

Usage🚀

DEMO

You should first install the requirements below:

pip install transformers==4.31.0 torch>=2.0 sentencepiece

NOTICE: Just make sure that the version of the transformers library is no higher than 4.33.2 !

For inference, note that we restore the task token <S> and special token <extra_id_0> in length adaptation and fine-tuning stage, so you may format your instruction input as <S> {your input} <extra_id_0> to get a better answer.

Below is a sentence completion example using OpenBA-LM.

>>> from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
>>> tokenizer = AutoTokenizer.from_pretrained("OpenBA/OpenBA-LM", trust_remote_code=True)
>>> model = AutoModelForSeq2SeqLM.from_pretrained("OpenBA/OpenBA-LM", trust_remote_code=True).half().cuda()
>>> model = model.eval()
>>> query = "<S>" + "苏州处太湖平原，沿江为高沙平原，河" + "<extra_id_0>"
>>> inputs = tokenizer(query, return_tensors="pt").to("cuda")
>>> outputs = model.generate(**inputs, do_sample=True, max_new_tokens=32)
>>> response = tokenizer.decode(outputs[0], skip_special_tokens=True)
>>> print(response)
流两侧为河淤平原,苏州平原是江苏平原主体,地势低平,土地肥沃,气候温和

Below is a instruction example using OpenBA-Flan.

>>> from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
>>> tokenizer = AutoTokenizer.from_pretrained("OpenBA/OpenBA-Flan", trust_remote_code=True)
>>> model = AutoModelForSeq2SeqLM.from_pretrained("OpenBA/OpenBA-Flan", trust_remote_code=True).half().cuda()
>>> model = model.eval()
>>> query = "<S>" + "介绍一下中国的四大名著，并分别概括其主要内容" + "<extra_id_0>"
>>> inputs = tokenizer(query, return_tensors="pt").to("cuda")
>>> outputs = model.generate(**inputs, do_sample=True, max_new_tokens=256)
>>> response = tokenizer.decode(outputs[0], skip_special_tokens=True)
>>> print(response)
中国的四大名著分别是《红楼梦》、《西游记》、《水浒传》和《三国演义》。它们分别包括故事情节、文化内涵和历史背景等方面的不同特点。《红楼梦》是一部中国古典小说,讲述了贾宝玉、林黛玉、薛宝钗等一群人物在贾府的生活和爱情故事。《西游记》是中国著名小说,描述了孙悟空、猪八戒、沙悟净等一众妖魔鬼怪的冒险历程和故事。《水浒传》是一部中国古典小说,描述了宋江等一百零八位好汉的反抗故事。《三国演义》是中国古代著名小说,讲述了三国时期的历史和战争故事。这些小说在文学、历史、哲学和文化等方面都有着不同的影响和地位。

You can run the chat demo as follows:

python gradio_chat_demo.py # run chat demo
python gradio_code_demo.py # run code demo

Training

Our training code are put in folder training. Based on Megatron-LM, we made the following implementations:

SwiGLU activation function,
UL2 training objective,
Rotary positional embedding,
A unified MMap data processing method for both pre-training and fine-tuning phases.

For pre-training, relevant requirements should be installed beforehand as stated in Megatron-LM, then you can simply run the following command to process texts into bytes, which can be read faster by a MMap Dataset:

cd training
bash scripts/data_process_span_corr.sh  # process pre-train data
bash scripts/data_process_flan.sh  # process fine-tune data

The you can run distributed training across multi nodes by

bash scripts/run_pretrain.sh  # pre-train
bash scripts/run_stretch.sh  # length adaptation
bash scripts/run_flan.sh   # fine-tune

Details

Model Structure

Generally, the OpenBA model follows the standard encoder-decoder architecture. However, it is worth noting that the encoder and decoder serve different roles, where the encoder endows the model with strong comprehension capability, and the decoder brings the model with generative ability. Existing works indicate that an encoder-decoder model with more encoder layers can achieve powerful performance. To fill the gap of deeper decoder-based LLM, we also design an asymmetric structure, where the hyper-parameters are listed in the table below.

Encoder	Decoder	Attn Heads	$d_{model}$	$d_{ff}$	#Param.(B)	Vocab Size	Training Tokens	Pos Emb
12	36	40	4096	16384	14.6	251000	380B	RoPE

Language(s) (NLP): Chinese/English
License: The code in this project is licensed under the Apache 2.0 license, and the model weights are licensed under the GNU AGPL 3.0 license. If you intend to use the models included in this project for commercial purposes or public deployment, please email us to obtain authorization. Commercial usage information will be used for record purposes only, and no fees will be charged.

Data Collection

The composition of Data collection. Figure (a) represents the composition ratio of the pre-training dataset. Figure (b) represents the composition of the bilingual Flan dataset. Figure (c) represents the finer-grained composition of the Chinese Flan dataset.

Disclaimers📌

The use of the OpenBA-LM should adhere to societal norms and not be used for any activities that jeopardize national or social security or violate the law. Additionally, we also request users not to use the OpenBA-LM for internet services that have not undergone appropriate security review and documentation. We hope that all users will abide by this principle to ensure that technological development occurs in a regulated and legal environment.

We have done our best to ensure the compliance of the data used during the model training process. However, despite our significant efforts, unforeseen issues may still arise due to the complexity of the model and data. If misleading or harmful statements are generated through the use of the models included in this project or their modified versions while providing services, the responsibility lies with the service provider and is not associated with this project.

Citation

Please add the citation if our paper or code helps you.

@article{li2023openba,
  title={OpenBA: An Open-sourced 15B Bilingual Asymmetric seq2seq Model Pre-trained from Scratch},
  author={Li, Juntao and Tang, Zecheng and Ding, Yuyang and Wang, Pinzheng and Guo, Pei and You, Wangjie and Qiao, Dan and Chen, Wenliang and Fu, Guohong and Zhu, Qiaoming and others},
  journal={arXiv preprint arXiv:2309.10706},
  year={2023}
}

Name		Name	Last commit message	Last commit date
Latest commit History 50 Commits
OpenBA		OpenBA
assets		assets
evaluation		evaluation
training		training
.gitignore		.gitignore
DATA_LICENSE.txt		DATA_LICENSE.txt
LICENSE		LICENSE
LICENSE.txt		LICENSE.txt
MODEL_LICENSE.txt		MODEL_LICENSE.txt
README.md		README.md
README_ZH.md		README_ZH.md
convert_megatron_to_hf_ckp.py		convert_megatron_to_hf_ckp.py
convert_megatron_to_hf_ckp.sh		convert_megatron_to_hf_ckp.sh
gradio_chat_demo.py		gradio_chat_demo.py
gradio_code_demo.py		gradio_code_demo.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Licenses found

Repository files navigation

OpenBA🎓

News🔥

Content📝

Open Source Checklist

Overview of Training process

Evaluation Results

C-EVAL

BBH

Reading Comprehension

Machine Translation

Usage🚀

DEMO

Training

Details

Model Structure

Data Collection

Disclaimers📌

Citation

About

Licenses found

Releases

Packages

Contributors 5

Languages

License

Licenses found

OpenNLG/OpenBA

Folders and files

Latest commit

History

Repository files navigation

OpenBA🎓

News🔥

Content📝

Open Source Checklist

Overview of Training process

Evaluation Results

C-EVAL

BBH

Reading Comprehension

Machine Translation

Usage🚀

DEMO

Training

Details

Model Structure

Data Collection

Disclaimers📌

Citation

About

Resources

License

Licenses found

Stars

Watchers

Forks

Releases

Packages 0

Contributors 5

Languages

Packages