BELLE: Bloom-Enhanced Large Language model Engine

本项目基于 Stanford Alpaca ，Stanford Alpaca 的目标是构建和开源一个基于LLaMA的模型。 Stanford Alpaca 的种子任务都是英语，收集的数据也都是英文，因此训练出来的模型未对中文优化。

本项目针对中文做了优化，模型调优仅使用由ChatGPT生产的数据（不包含任何其他数据）。项目包含以下内容:

175个中文种子任务
生成数据的代码
基于BLOOMZ-7B1-mt优化后的模型

欢迎大家通过issue贡献更多的prompts！

概述

使用了Self-Instruct论文中介绍的技术生成了52K条指令数据，同时进行了一些修改，在初步的人类评估中，发现Alpaca 7B模型在Self-Instruct指令评估上的表现类似于text-davinci-003模型。

使用Alpaca模型的在线演示服务，我们发现该模型在中文上的表现还不太好。推测是因为Stanford Alpaca 的种子任务都是英语，收集的数据也都是英文，因此训练出来的模型未对中文优化。为了提升在中文上的效果，本项目基于Stanford Alpaca优化了中文种子任务，对生成代码进行了一些修改，同时选用BLOOMZ-7B作为基础模型训练得到了一个可以更好支持中文指令的开源模型 - BELLE。

我们开源基于Alpaca的数据收集代码，基于这段代码生成了约18万条中文数据，结合Alpaca的5万条英文数据，在BLOOMZ-7B模型训练得到的checkpoint上传在Hugging Face。

数据发布

zh_seed_tasks.jsonl：包含175个种子任务，样例如下
{"id": "seed_task_20", "name": "horror_movie_opening", "instruction": "你需要为一部恐怖电影写一个创意的开场场景。", "instances": [{"input": "", "output": "太阳已经落山，留下了一个黑暗的小镇。微风吹拂空荡的街道，让每一个冒险走出门外的人感到一阵寒意。唯一的声音是被风吹动的树叶发出的轻微沙沙声。突然，一声令人毛骨悚然的尖叫声划破了寂静，随后是玻璃破碎的声音。一所房子亮起了灯光，可以看到一个人影朝镇中心奔跑。当>那个人影越来越靠近时，清楚地看到那是一个年轻女子，她浑身血迹斑斑。"}], "is_classification": false}
prompt_cn.txt: 生成所使用的提示语

数据生成

沿用Alpaca的方式：

pip install -r requirements.txt export OPENAI_API_KEY=YOUR_API_KEY python generate_instruction.py generate_instruction_following_data

默认使用Completion API，模型text-davinci-003。如果想使用Chat API并使用gpt-3.5-turbo模型，可通过参数控制：

python generate_instruction.py generate_instruction_following_data \ --api=chat --model_name=gpt-3.5-turbo

输出文件在Belle.train.json，可以人工筛选后再使用。

模型调优

我们基于BLOOMZ-7B1-mt模型和Belle.train.json训练模型，具体参数如下：

参数	值
Batch size	64
Learning rate	3e-6
Epochs	3
Weight_decay	0.001
Warmup_rate	0.1
LR_scheduler	linear

我们已经将训练得到的模型参数开源：https://huggingface.co/jay68/BELLE-7B-0.2M

局限性和使用限制

基于当前数据和基础模型训练得到的SFT模型，在效果上仍存在以下问题：

在涉及事实性的指令上可能会产生违背事实的错误回答。
对于具备危害性的指令无法很好的鉴别，由此会产生危害性言论。
在一些涉及推理、代码等场景下模型的能力仍有待提高。

基于以上模型局限性，我们要求开发者仅将我们开源的代码、数据、模型及后续用此项目生成的衍生物用于研究目的，不得用于商业，以及其他会对社会带来危害的用途。

引用

如果使用本项目的代码、数据或模型，请引用本项目。

@misc{BELLE, author = {Yunjie Ji, Yong Deng, Yan Gong, Yiping Peng, Qiang Niu, Baochang Ma, Xiangang Li}, title = {BELLE: Bloom-Enhanced Large Language model Engine }, year = {2023}, publisher = {GitHub}, journal = {GitHub repository}, howpublished = {\url{https://github.com/LianjiaTech/BELLE}}, }

当然，你也需要引用原始的BLOOM论文、Stanford Alpaca和Self-Instruct论文。

BELLE: Bloom-Enhanced Large Language model Engine

This project is from Stanford Alpaca which aims to build and share instruction-following LLaMA model.
The seed tasks in Stanford Alpaca are English only, and the model performs relatively poorly in Chinese.

This project optimizes Chinese performance in addition to original Alpaca. The model finetuning uses only data generated via ChatGPT (without other data). This repo contains:

The 175 chinese seed tasks used for generating the data
The code for generating the data
The 183,536 generated data used for fine-tuning the model
The model finetuned from BLOOMZ-7B1-mt on data generated by this project

More prompts are welcomed via issues!

Overview

Stanford Alpaca mentioned

The current Alpaca model is fine-tuned from a 7B LLaMA model on 52K instruction-following data generated by the techniques in the Self-Instruct paper, with some modifications... . In a preliminary human evaluation, we found that the Alpaca 7B model behaves similarly to the text-davinci-003 model on the Self-Instruct instruction-following evaluation suite.

From the web demo of Alpaca, we found it's performance on Chinese is not as well. We speculate the reason to be that the seed tasks of Stanford Alpaca are all English, and the generated data are also in English, so model tuned on it is not optimized for Chinese. This project aims to boost Chinese performance with improved Chinese seed tasks based on Stanford Alpaca, some modification to to instruction generation code, and also BLOOMZ-7B as the base model. The result is a model which better supports Chinese - BELLE.

The instruction generation code and finetuned model checkpoint Hugging Face trained on the generated dataset (approx. 180k instruction and answer pairs, plus original ~50k Alpaca pairs) based on BLOOMZ-7B are both open sourced.

Data Release

zh_seed_tasks.jsonl contains 175 seed tasks, for example:
{"id": "seed_task_20", "name": "horror_movie_opening", "instruction": "你需要为一部恐怖电影写一个创意的开场场景。", "instances": [{"input": "", "output": "太阳已经落山，留下了一个黑暗的小镇。微风吹拂空荡的街道，让每一个冒险走出门外的人感到一阵寒意。唯一的声音是被风吹动的树叶发出的轻微沙沙声。突然，一声令人毛骨悚然的尖叫声划破了寂静，随后是玻璃破碎的声音。一所房子亮起了灯光，可以看到一个人影朝镇中心奔跑。当>那个人影越来越靠近时，清楚地看到那是一个年轻女子，她浑身血迹斑斑。"}], "is_classification": false}
prompt_cn.txt Chinese prompt for generating instructions

Data Generation Process

Following Alpaca:

pip install -r requirements.txt export OPENAI_API_KEY=YOUR_API_KEY python generate_instruction.py generate_instruction_following_data

Uses the Completion API and text-davinci-003 model by default. To use Chat API and gpt-3.5-turbo model, just change the arguments:

python generate_instruction.py generate_instruction_following_data \ --api=chat --model_name=gpt-3.5-turbo

Generated instructions are in Belle.train.json, you can check manually before using it.

Fine-tuning

Finetuning is done based on BLOOMZ-7B1-mt and Belle.train.json using the following hyperparameters:

Hyperparameter	Value
Batch size	64
Learning rate	3e-6
Epochs	3
Weight_decay	0.001
Warmup_rate	0.1
LR_scheduler	linear

Trained checkpoint: https://huggingface.co/jay68/BELLE-7B-0.2M

Limitation and Usage Limits

There still exists a few issues in the model trained on current base model and data:

The model might generate factual errors when asked to follow instructions related to facts.
Occasionally generates harmful responses since the model still struggles to identify potential harmful instructions.
Needs improvements on reasoning and coding.

Since the model still has its limitations, we require developers only use the open-sourced code, data, model and any other artifacts generated via this project for research purposes. Commercial use and other potential harmful use cases are not allowed.

Citation

Please cite us when using our code, data or model.

@misc{BELLE, author = {Yunjie Ji, Yong Deng, Yan Gong, Yiping Peng, Qiang Niu, Baochang Ma, Xiangang Li}, title = {BELLE: Bloom-Enhanced Large Language model Engine }, year = {2023}, publisher = {GitHub}, journal = {GitHub repository}, howpublished = {\url{https://github.com/LianjiaTech/BELLE}}, }

Cite the original BLOOM, Stanford Alpaca and Self-Instruct papers as well!

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.gitignore		.gitignore
DATA_LICENSE		DATA_LICENSE
LICENSE		LICENSE
README.md		README.md
generate_instruction.py		generate_instruction.py
prompt_cn.txt		prompt_cn.txt
requirements.txt		requirements.txt
utils.py		utils.py
zh_seed_tasks.json		zh_seed_tasks.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BELLE: Bloom-Enhanced Large Language model Engine

概述

数据发布

数据生成

模型调优

局限性和使用限制

引用

BELLE: Bloom-Enhanced Large Language model Engine

Overview

Data Release

Data Generation Process

Fine-tuning

Limitation and Usage Limits

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

BELLE: Bloom-Enhanced Large Language model Engine

概述

数据发布

数据生成

模型调优

局限性和使用限制

引用

BELLE: Bloom-Enhanced Large Language model Engine

Overview

Data Release

Data Generation Process

Fine-tuning

Limitation and Usage Limits

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages