Releases: PaddlePaddle/PaddleNLP
v3.0.0-beta2
本次更新强化了PaddleNLP的基础设施,新增了Qwen2.5、Mixtral 8*22B模型并升级了Tokenizer功能,同时重命名了数据索引工具。
此外,还修复了MoE模型参数保存与加载等问题,提升了文本处理准确性,并更新了文档与测试用例。在推理性能、硬件支持及自动并行方面也进行了优化,包括支持更多模型与参数配置、多GPU推理、国产硬件支持增强以及分布式训练流程优化等。
核心变更与增强功能
-
基础设施强化:
-
问题修复:
-
文档与测试更新:
-
其他关键变更:
- 推理性能优化:
- 硬件支持拓展:
- 自动并行优化:
What's Changed
- [Unified checkpoint] update optimizer async save signal by @DesmonDay in #8975
- 更正run_dpo.py文件路径 by @Mangodadada in #8952
- fix the loss base in llama_align_dygraph_dy2st_auto_bs2_bf16_DP2-MP1-… by @winter-wang in #8986
- [Bug fix] fix skip consumed_samples twice bug by @zhangyuqin1998 in #8980
- fix pip error in legacy benchmarks by @fightfat in #8978
- 【auto_parallel】Add checkpoint convertor by @xingmingyyj in #8847
- [llm]update finetune.md by @lugimzzz in #8990
- tool_helpers升级后可以支持32766个数据集. by @JunnYu in #8994
- add DCU inference docs by @YanhuiDua in #8983
- [Distributed]Add loss nan/inf checker by @ForFishes in #8943
- 【llm】update docs by @lugimzzz in #8999
- [Feature] Fused Mixtral support by @penPenf28 in #8901
- [XPU] Add README.md for llama2-7b by @xiguapipi in #8979
- Add gcu llama readme by @EnflameGCU in #8950
- fix qwen model use_casual_mask by @deepllz in #9009
- [ZeroPadding] revert zero_padding #8973 by @DrownFish19 in #9003
- [LLM Inference] Fix step.cu bug by @yuanlehome in #8995
- Refine checkpoint converter by @zhangbo9674 in #9001
- [Feature] fused mixtral wint4 by @penPenf28 in #9013
- llm inference docs by @Sunny-bot1 in #8976
- [LLM Inference] Support Qwen2_Moe Inference Model by @CJ77Qi in #8892
- fix llama3 static run by @yuanlehome in #8849
- [paddle inference cpu]update cpu inference by @bukejiyu in #8984
- fix the tipc ce case by @wawltor in #8748
- [Cherry-pick] Add is_distributed field in sharding reshard param_meta by @sneaxiy in #9028
- [Tokenizer] Support for loading added_tokens_decoder by @DrownFish19 in #8997
- [Inference] Add a8w8(fp8) a8w8c8(int8) quant_type support by @lixcli in #9032
- Fix checker of nan/inf by @ForFishes in #9029
- [Cherry-pick] add comm buffer size (#8963) by @ForFishes in #9031
- [Unified Checkpoint] Update async save info by @DesmonDay in #8982
- [llm]support pad to max_length & fix sp bug by @lugimzzz in #9040
- [Bugfix] fix bias optional by @penPenf28 in #9037
- fix setup.py for llm inference by @yuanlehome in #9041
- [Inference] Add cutlass gemm dequant op by @gzy19990617 in #8909
- [Inference] update fakequant support by @lixcli in #9047
- add test for pir sequence parallel on llama model by @liym27 in #9015
- Fix moe save load by @Meiyim in #9045
- Update quantization.md by @ZHUI in #9057
- 【Fix】Initialize dp degree in single GPU by @greycooker in #9056
- fix bos download by @westfish in #9023
- [Inference] Update fakequant script by @lixcli in #9054
- [AutoParallel][PIR] Fit pir grad merge by @AndSonder in #8985
- [MLU] Support rms_norm_mlu by @PeiyuLau in #8504
- [Inference] support llama3 a8w8c8_fp8 inference and cutlass_fp8_gemm by @ckl117 in #8953
- [Inference] Qwen2 support fp8 inference by @ckl117 in #8954
- [Version] update version info by @DrownFish19 in #9060
- [NPU] Fix baichuan2-13b-chat infer by @ronny1996 in #9070
- [MLU] Fix Llama attrntion_mask in npu and mlu by @DrownFish19 in #9075
- Fix the memory overflow bug of the tune_cublaslt_gemm operator by @Hanyonggong in #9076
- [Inference] Fix weight_only_int4 bug by @lixcli in #9073
- [Auto Parallel] fix data stream bug of dist.to_static by @zhangyuqin1998 in #9077
- fix hang when Flag_dataloader_use_file_descriptor=True by @deepllz in #9080
- fix llm predict install error by @fightfat in #9088
- [PIR] add pir grad merge test by @AndSonder in #9074
- Update readme by @EnflameGCU in #9046
- [LLM] Add tensor parallel for chatglmv2 by @SevenSamon in #9014
- [data] update tool_helpers version and add unittest by @JunnYu in #9093
- fix baseline because of PR#8769 by @fightfat in #9092
- fix use paddle.incubate.jit.inference(model) errors by @chang-wenbin in #9016
- [CI] Fix paddlepaddle install by @DesmonDay in #9102
- [LLM] fix train on npu by @SylarTiaNII in #9101
- Disable ut by @zhangbo9674 in #9108
- [AutoParallel] Enable CI for gradclip by @JZ-LIANG in #9059
- [Inference] Remove ceval from run_finetune by @lixcli in #9100
- [Bugfix] fix multi-gpu infer by @penPenf28 in #9107
- 【Inference】fix step kernel by @gzy19990617 in #9122
- [DCU] fix DCU w8a8c8 GEMM shape by @YanhuiDua in #9115
- [Inference] FP8 gemm auto-tune by @ckl117 in #9094
- Open ut llama_align_dygraph_dy2st_pir_auto_grad_merge_bs2_fp32_DP1-MP1-PP1 by @zhangbo9674 in #9120
- [LLM Inference] Support Qwen2_Moe Inference with MultiGPU by @CJ77Qi in #9121
- [Unified Checkpoint] Fix uc lora config, fix release_grads by @DesmonDay in #9082
- [Inference]qwen2-a8w8c8 support use_fake_parameter by @ckl117 in #9109
- Add fast_ln spmd rules by @From00 in #9125
- fix pir dtype by @wanghuancoder in #9130
- Remove ring_flash_attention warning by @DrownFish19 in #9119
- [DOC] Fix LLM page 404 Not Found by @DrRyanHuang in #9127
- Add hardware flops for pretraining by @ZHUI in #9069
- [Benchmark] Fix amp level bug in some gpt tests by @zhangbo9674 in #9116
- [Auto Parallel] Fix ckpt_converter for auto_parallel by...
v3.0.0-beta1
PaddleNLP从v3.0.0-beta0升级至v3.0.0-beta1版本,带来了多项重要更新与增强。新引入了Yuan、mamba和jamba模型,并优化了LLM推理代码,提升了兼容性和效率。
基础性能优化方面,添加了快速分词器,实现了MoE优化器参数广播,加速了层归一化。同时,修复了多个bug,包括safetensors shape切片问题和Windows下mmap问题,提升了系统稳定性和兼容性。
文档与测试方面,进行了全面更新和优化,确保了文档的准确性和代码的可读性。此外,还增强了国产硬件支持,包括DCU和XPU的优化,以及PIR模式和自动并行的配置更新。
主要变更与新增功能
1. 新模型与特性引入
- 新模型:在#8654 中引入了Yuan模型;在#8513 和#8517 中分别添加了mamba和jamba新模型,并在后续Pull Request中修复了相关bug,确保了模型的稳定运行。
- LLM推理优化:通过多个Pull Request,我们优化了LLM推理代码,并新增了对新模型和参数的支持,进一步提升了推理效率和兼容性。
2. 基础性能优化
- 快速分词器:在#8832 中,我们添加了基于
tokenizers
库的快速分词器,显著提升了分词速度和性能。 - MoE优化:在#8810 中,我们实现了MoE(Mixture of Experts)优化器参数的广播,有效增强了模型训练的效率。
- 层归一化加速:通过多个Pull Request,我们添加了fast_rmsnorm,启用了use_fast_layer_norm,并更新了基准测试配置,进一步加速了模型训练过程。特别是在#8717 中,我们支持了在微调过程中使用use_fast_layer_norm,为用户提供了更多灵活性。
- 训练性能优化:在#8803 中,我们添加了
enable_sp_async_reduce_scatter
选项,有效优化了训练性能。 - 字典参数支持:在#8446 中,我们为trainer的argparser添加了支持字典参数的新特性,增强了参数传递的灵活性。同时,在#8904 中,我们更新了tensorboard的要求,确保了与最新版本的兼容性。
3. Bug修复
- safetensors修复:在#8702 中,我们修复了safetensors的形状问题。
- Windows系统mmap修复:在#8734 中修复了mmap问题,提升了windows的兼容性。
- 其他Bug修复:包括#8687 、#8730 等多个Pull Request中的bug修复。
4. 文档与测试更新
- 文档优化:在多个Pull Request中,我们进行了文档更新、代码风格清理和版本信息更新,确保了文档的准确性和可读性。
- README修复与增强:在#8741 中,我们修复了README中的断链问题;同时,多个贡献者更新了README文档,添加了新的测试用例,确保了文档与代码的同步更新。
5. 其他重要变更
国产硬件支持增强
- DCU支持:在#8580 中,我们实现了针对DCU的高性能LLM训练和推理,拓展了PaddleNLP的硬件支持范围。
- XPU优化:在#8527 中,我们为XPU添加了LoRA优化;在#8697 和#8710 中,我们分别实现了XPU的allgather功能和修复了统一检查点的gather问题,进一步提升了XPU上的模型训练效率。
PIR模式支持
- 导出与加载优化:在#8689 中,我们修改了PIR模式下llama模型的导出方式;在#8712 和#8766 中,我们支持了以三种模式(旧IR、PIR模型文件、PIR JSON文件)加载或保存Llama2-7b模型,为用户提供了更多灵活性和兼容性。
自动并行优化
- 配置更新:在#8679 中,我们更改了Llama2-7b配置中的
max_steps
以适应自动并行;在#8767 和#8828 中,我们优化了自动训练器的保存和加载功能;在#8750 中,我们更新了全局剪切的损失函数,进一步提升了自动并行的效率和准确性。
What's Changed
- [DCU] high performance LLM train and inference for DCU by @yuguo-Jack in #8580
- fix benchmark dir and add CUDA_DEVICE_MAX_CONNECTIONS to qwen by @fightfat in #8678
- bug fix by @wtmlon in #8687
- [XPU] add lora optimization by @dynamicheart in #8527
- [pir save] Modiy export llama model file in pir mode by @xiaoguoguo626807 in #8689
- [AutoParallel]Change
max_steps
in Llama2-7b config for auto-parallel. by @heavyrain-lzy in #8679 - [benchmark] Change the mirror source for pip by @mmglove in #8699
- update loss base of auto-parallel tests by @zhiqiu in #8701
- Add new mistral by @wtmlon in #7425
- [Safetensors] Fix safetensors shape by @DesmonDay in #8702
- [BUG] num_samples 向下去整, 防止prefrech预取时候超过数据集最大长度... by @JunnYu in #8690
- xpu use allgather by @FeixLiu in #8697
- add fast_rmsnorm by @deepllz in #8680
- enable use_fast_layer_norm for llama2 benchmark by @deepllz in #8714
- fix xpu gather for unified ckpt by @FeixLiu in #8710
- [inference] support load or save Llama2-7b in three patterns by @lizexu123 in #8712
- fix fast_ln backward by @deepllz in #8719
- finetune support use_fast_layer_norm by @tianhaodongbd in #8717
- bug fix by @FeixLiu in #8730
- disable lora by @lugimzzz in #8674
- [Safetensors] Fix mmap for Windows system by @DrownFish19 in #8734
- correct broken links in readme by @jzhang533 in #8741
- revert benchmark fix by @ronny1996 in #8747
- [LLM] Add Yuan model by @zhaogf01 in #8654
- fix nlp dir and auto_parallel_ci exit -6 by @fightfat in #8744
- [LLM] Update sequence parallel linear import by @DrownFish19 in #8706
- [Bug fixes] Fix ring attention by @zhangyuqin1998 in #8740
- update a100 loss by @zhiqiu in #8708
- [PaddleNLP 3.0] Update README by @DrownFish19 in #8681
- [AutoParallel] update loss for global clip by @JZ-LIANG in #8750
- [NPU] Fix sequence parallel lib import by @DrownFish19 in #8760
- [DEV] Update develop version show by @DrownFish19 in #8754
- [inference] support load or save Llama2-7b in three patterns by @lizexu123 in #8766
- add benchmark baichuan2 scripts by @fightfat in #8683
- Add the missing truncation=True in llm/predictor.py by @lszxb in #8768
- fix the ce for the unittest by @wawltor in #8772
- Enable parallel_config to use commas as delimiters. by @Difers in #8677
- fix incorrect token counting in
llm/predictor.py
by @lszxb in #8769 - Refine savable by @ZHUI in #8758
- [CodeStyle] remove markdownlint-cli by @DrownFish19 in #8779
- [XPU] use allgather and fp32 multinomial for XPU by @houj04 in #8787
- fix version show by @DrownFish19 in #8791
- [BUG] Add 20 redundant data in post pretrain by @JunnYu in #8789
- vera-pissa method added by @TranscenderNing in #8722
- update version by @DrownFish19 in #8792
- [Inference LLM] refine some code in llama wint8/4 by @yuanlehome in #8796
- [DCU] Llama a8w8 inference performance optimization by @Deleter-D in #8800
- [Prediction] Update LLM prediction. by @DesmonDay in #8778
- [Trainer] Add enable_sp_async_reduce_scatter by @DesmonDay in #8803
- [AutoParallel] Refine auto_trainer save load by @zhangbo9674 in #8767
- [MoE] Optimizer parameter broadcast by @DesmonDay in #8810
- [Doc] Update README by @DrownFish19 in #8817
- support Llama3.1 8B 128K generation on single GPU 80GB by @GuoxiaWang in #8811
- add paddle nv-embed-v1 by @Li-Z-Q in #8785
- fix pad_token_id bug by @yuanlehome in #8814
- [DCU] fix llama inference bug on DCU by @Deleter-D in #8815
- [Doc] Add LLaMA3.1 by @DrownFish19 in #8824
- [BUG] Fix build train valid test datasets by @JunnYu in #8826
- Add tune_cublaslt_gemm operator by cublaslt gemm algorithm and generate algo cache file by @Hanyonggong in #8799
- fix tune_cublaslt_gemm compile bug by @yuanlehome in #8844
- [AutoParallel] Refine save and load ckpt for auto_trainer by @zhangbo9674 in #8828
- [Unified Checkpoint] update merge tensor parallel by @DesmonDay in #8856
- [Trainer] update clear_grad by @DesmonDay in #8829
- [Unified Checkpoint] Fix tie_word_embeddings by @DesmonDay in #8795
- [Inference LLM] support static c8 by @yuanlehome in #8833
- support sft mapdataset by @greycooker in #8840
- Cherry pick some changes from incubate branch by @sneaxiy in #8862
- support nested list of dict inputs by @deepllz in #8876
- Fix the bug with issues code 8641. by @smallbenxiong in #8880
- Fix the issue of P-tuning official sample error by @guangyunms in #8884
- modify Paddlemix qwen dytostatic by @xiaoguoguo626807 in #8869
- [llm]fix zeropadding by @lugimzzz in #8895
- 修复fast_ln算子动半开启后报错 by @Wennie396 in #8891
- enable_sp_async_reduce_scatter for qwen_72b && llama2_70b by @deepllz in #8897
- Update run_pretrain.py by @...
v3.0.0-beta0
很高兴地通知大家,飞桨大模型套件发布v3.0.0beat版本:拥抱大模型,体验全升级。具体工作如下:
- 统一大模型工具链,实现国产计算芯片全流程接入;
- 全面支持飞桨4D并行配置、高效精调策略、高效对齐算法、高性能推理等大模型产业级应用流程;
- 自研极致收敛的RsLoRA+算法、自动扩缩容存储机制Unified Checkpoint和通用化支持FastFFN、FusedQKV助力大模型训推;
- 主流模型持续支持更新,提供高效解决方案。
大模型精调对齐训推优化
-
PEFT:
-
DPO:
-
国产芯片支持:
-
性能优化:
-
其他
- 新增模型内存监控 in #8269
模型新增
-
新增Gemma模型 in #8082
- google/gemma-7b
- google/gemma-7b-it
- google/gemma-2b
- google/gemma-2b-it
-
- meta-llama/Meta-Llama-3-8B
- meta-llama/Meta-Llama-3-8B-Instruct
- meta-llama/Meta-Llama-3-70B
- meta-llama/Meta-Llama-3-70B-Instruct
-
新增Qwen2模型 in #8338 #8584 #8601
- Qwen/Qwen1.5-0.5B
- Qwen/Qwen1.5-0.5B-Chat
- Qwen/Qwen1.5-1.8B
- Qwen/Qwen1.5-1.8B-Chat
- Qwen/Qwen1.5-4B
- Qwen/Qwen1.5-4B-Chat
- Qwen/Qwen1.5-7B
- Qwen/Qwen1.5-7B-Chat
- Qwen/Qwen1.5-14B
- Qwen/Qwen1.5-14B-Chat
- Qwen/Qwen1.5-32B
- Qwen/Qwen1.5-32B-Chat
- Qwen/Qwen1.5-72B
- Qwen/Qwen1.5-72B-Chat
- Qwen/Qwen1.5-110B
- Qwen/Qwen1.5-110B-Chat
- Qwen/Qwen1.5-MoE-A2.7B
- Qwen/Qwen1.5-MoE-A2.7B-Chat
- Qwen/Qwen2-0.5B
- Qwen/Qwen2-0.5B-Instruct
- Qwen/Qwen2-1.5B
- Qwen/Qwen2-1.5B-Instruct
- Qwen/Qwen2-7B
- Qwen/Qwen2-7B-Instruct
- Qwen/Qwen2-72B
- Qwen/Qwen2-72B-Instruct
- Qwen/Qwen2-57B-A14B
- Qwen/Qwen2-57B-A14B-Instruct
基础框架升级
-
功能优化:
-
AutoParallel优化
-
分布式能力优化:
-
chat能力优化:
- 增加Chat template in #8226
-
其他
问题修复
- 修复sharding数量小于100的bug in #8146
- 修复TP/PP参数合并问题 in #8239
- 修复tensor.shape与paddle.shape(tensor)不一致问题 in #8260
- 修复fp16+delay_scale_loss_scale+sharding_stage1_overlap的bug in #8314
- 增加pipelines运行文档及提示 in #8292 #8308 #8202 #8353
- 修复text feature extraction任务中tokenizer输入 in #8331
- 修复import error in #8332 #8367
结构调整
PaddleNLP文件结构调整 in #8609 #8613 #8605 #8614 #8617 #8626 #8618 #8625 #8619 #8629 #8601 #8627 #8666
What's Changed
- [dist]pip requirements-dev.txt by @Liujie0926 in #8258
- add scaling by @lugimzzz in #8256
- [LLM]Support Gemma model by @Southpika in #8082
- [BugFix] Try except sequence parallel utils by @DesmonDay in #8189
- Update CodeCov GitHub Action by @sijunhe in #8268
- [AutoParallel] Open recompute strategy for llama model by @zhangbo9674 in #8265
- Fix sharding < 100 limitation bug by @sneaxiy in #8146
- use tensor.shape bug not paddle.shape(tensor) by @wanghuancoder in #8260
- [dist CI]update paddlenlp install for CI by @Liujie0926 in #8267
- [Bug Fix]Fix merge parameters in pp by @Southpika in #8239
- [LLM] add memory stats to logger of trainer by @SylarTiaNII in #8269
- Add p2p_comm_overlap for Llama-2-70b benchmark. by @Xreki in #8276
- add a100 test ground truth by @zhiqiu in #8249
- [paddle-pipelines] faq semantic search question answering reamde by @w5688414 in #8292
- [paddle-pipelines] Add pipelines documentation by @w5688414 in #8308
- Support llama-3 by @ZHUI in #8307
- [Distributed] [CustomDevices] Adapt SP on lora && polish MC2 APIs by @SylarTiaNII in #8303
- fix bug for fp16 + delay_scale_loss_scale + sharding_stage1_overlap by @FeixLiu in #8314
- [paddle-pipelines] Update mkdocs by @w5688414 in #8310
- [benchmark]update llama2_ips by @Liujie0926 in #8322
- [dist CI]fix before_hook by @Liujie0926 in #8283
- benchmark llama worker=1 by @wanghuancoder in #8305
- 【AutoParallel】Add llama2 UT for auto-parallel by @heavyrain-lzy in #8300
- Add system env log for llama test by @zhangbo9674 in #8321
- [LLM] Support fuse attention q, k, v weights by @DrownFish19 in #8202
- [Distributed] fix lora by @SylarTiaNII in #8325
- fix try import by @w5688414 in https://github.com/PaddlePaddle/Pa...
v2.8.1
What's Changed
- [Trainer] Fix sharding overlap bug by @DesmonDay in #8334
- [Cherry-pick] update truncate by @KB-Ding in #8375
- [BugFix] Fix llama3
eot_id
. by @ZHUI in #8373 - [Trainer] update distributed dataloader by @DesmonDay in #8426
- [BugFix] Fix load rng compatibility. by @ZHUI in #8451
- Cherry pick/fast_safe_open by @ZHUI in #8458
- 【cherry pick】adapter new type promotion rule for Paddle 2.6 by @zxcd in #8463
- Quick fix from pretrained. by @ZHUI in #8487
- Release/2.8 by @Galaxy1458 in #8437
- Fix from_pretrained
os.path.split
by @DesmonDay in #8508 - [fea] Cherry-picked MOE updates from develop by @bo-ke in #8531
- [LLM] relocate tensor_parallel_output to avoid conflict (#8419) by @DesmonDay in #8533
- Update sequence_parallel for predict by @DesmonDay in #8547
- Cp/fix by @ZHUI in #8569
- Do not save moe_group by @DesmonDay in #8570
- [Release] 2.8.1 by @ZHUI in #8636
Full Changelog: v2.8.0...v2.8.1
v2.8.0
很高兴地通知大家,飞桨大模型套件发布v2.8.0版本。这个版本中,我们深度优化套件的大模型精调对齐的能力,提升大模型套件在国产计算硬件训推能力,具体工作如下:
- 特色精调和高效对齐:提供自研极致收敛的RsLoRA+算法,大幅提升PEFT训练收敛速度以及训练效果;引入高性能生成加速到RLHF PPO算法,打破 PPO 训练中生成速度瓶颈,PPO训练性能大幅领先。
- 大模型训练提速:通用化支持 FastFNN、FusedQKV等多个大模型训练性能优化方式,大模型训练更快、更稳定。
大模型精调对齐训推优化
- 精调
- 推理
- 新增QWenVL 的静态图推理 #7808
模型新增
- 新增QWenVL 的静态图推理 #7808
- 新增Deberta,Debertav2模型 #8227
- deepset/deberta-v3-large-squad2
- microsoft/deberta-v2-xlarge
- microsoft/deberta-v3-base
- microsoft/deberta-v3-large
- microsoft/deberta-base
- 新增mixtral-of-experts #7803
- mistralai/Mixtral-8x7B-Instruct-v0.1
- mistralai/Mixtral-8x7B-v0.1
- 新增LLama3 #8315
- meta-llama/Meta-llama-3-8b
- meta-llama/Meta-Llama-3-8B-Instruct
- meta-llama/Meta-llama-3-70b
- meta-llama/Meta-Llama-3-70B-Instruct
基础框架升级
- Trainer升级
- AutoParallel升级
- 其他
其他支持
- 新增俄罗斯套娃(matryoshka representation learning)检索策略,节省计算和存储资源。#8165
问题修复
- 日志级别修改,并增加timelog计时日志,兼容不同设备。#8261
- 修复pipeline并行中随机初始化的shared weights不一致的问题,覆盖GPT/OPT等模型。#7772
- 关闭CI及单测中从huggingface hub下载的逻辑 #7798 #8198
- 修复llm的gradio开启chat template时候重复拼接query 和 history的问题。#7992
- 修复GPT模型下载key error问题。#8253
- 修复LlamaRotaryEmbedding #7882
- 修复allreduce dtype的问题 #7876
- 修复框架侧dev分支清理 paddle.jit.dy2static.utils_helperAPI的问题 #7989
- 修复read-data timer在ignore_data_skip=False and skip_profile_timer=False 的问题。#8177
- 修复Wandb单测问题 #8066 #8056
- 修复Trainer同时解析json与命令行列表参数报错问题#7860
- 修复Gradio UI 中的推理问题 #7740 #7788
- 修复 Tokenizer 相关的基础问题 #7797 7870
- 修复 custom devices上loading rng state的问题。#7894
- 修复自动并行打印BF16的loss编码错乱的问题#7874
- 采用float初始化模型,修复静态图自动并行AMP报错问题#8033#8199
- 修复ShardDataloader接口在PipeLine Parallelism下使用错误问题#8014
- 修复llama在custom devices的精度问题。#7895
- 修复NPU AICPU算子问题 #7976
- 修复FusedLinearWithGradAdd少传参数的问题。#8178
What's Changed
- [Unified Checkpoint] Add unified checkpoint training args doc. by @DesmonDay in #7756
- [AutoParallel] Auto Trans PP to VPP by @zhaoyinglia in #7747
- Add codecov check by @zjjlivein in #7760
- [CE] Delete gpt_for_sequence_classification by @ZHUI in #7757
- [DOC] Update trainer.md by @ZHUI in #7761
- [Release] Change version to 2.7.0 by @ZHUI in #7764
- [benchmark]close skip_memory_metrics for ips by @Liujie0926 in #7732
- [Release] Update release.yml to release tags by @ZHUI in #7765
- [AutoParallel] Add Sequence Parallel for Static LLaMA by @JZ-LIANG in #7746
- [New Features] support dynamic src_length by @wj-Mcat in #7740
- Fix unified_checkpoint bug by @DrownFish19 in #7770
- [DONE] aistudio, hf hub, bos update download by @JunnYu in #7608
- [Trainer] Fix dist dataloader eval by @DesmonDay in #7777
- [Paddle-pipelines] Update convert_files_to_dicts_splitter by @w5688414 in #7748
- [PEFT]fix lora model tp when existing other trainable module by @lugimzzz in #7781
- [Paddle-Pipelines] update faiss by @qingzhong1 in #7793
- Fix shared weights sync for PipelineLayer by @DrownFish19 in #7772
- [tests] download slow by @JunnYu in #7798
- [INFER][LLM] Support qwen in fined grained dybatch v1 by @DanGuge in #7644
- Add CE for Distributed Hybrid Parallel by @iosmers in #7782
- add MP2-SP2-pp4-vpp2-SD2-stage1-mbs2-acc8 ce by @tianhaodongbd in #7774
- [Pretrain] Fix eval during pretrain by @DesmonDay in #7806
- pipeline parallel benchmark by @zhangting2020 in #7759
- [Bug fixes] fix br gradio by @wj-Mcat in #7788
- delete useless code for write_cache_kv.cu by @yuanlehome in #7812
- [llm]support qlora pp by @lugimzzz in #7801
- Trainer support simultaneously parse JSON files and cmd arguments. by @greycooker in #7768
- [LLM] Support block_attention/cachekv quant for llama by @RichardWooSJTU in #7649
- [Bug Fix] fix paddle multipy_fwd_func warning message by @BeingGod in #7818
- [llm]fix lora by @lugimzzz in #7824
- fused rms spmd by @liuzhenhai93 in #7830
- [Pretrain] Fix eval during pretrain by @DesmonDay in #7827
- [neural search][fix bug of evaluate.py] by @ZeyuTeng96 in #7832
- [neural search] fix the bug of reading files when calculating the recall scores by @shenghwa in #7836
- [Bug fixes] update chatglm tokenizer by @wj-Mcat in #7797
- [semantic_indexing] fix bug of evaluate.py by @ZeyuTeng96 in #7843
- [faq] fix bug of evaluate.py by @ZeyuTeng96 in #7840
- [text_classification_retrieval_based] fix bug of evaluate.py by @ZeyuTeng96 in #7844
- [LLM] add Qwen-7B-Chat to PaddleNLP unit test by @ziangqin-baidu in #7823
- Support 5.2 bloom by @zhoutianzi666 in #7846
- [unified checkpoint] Fix last checkpoint save by @DrownFish19 in #7854
- [unified checkpoint] fix checkpoint names by @DrownFish19 in #7795
- [New Features]add ranks testing for test_predictor by @wj-Mcat in #7800
- [Auto Parallel] Support dynamic semi-auto training in Llama2 model by @haohongxiang in #7851
- [CI] add ci approval pipelines by @zjjlivein in #7859
- [fix] fix a bug of trainer/argparser.py by @greycooker in #7860
- [Improvement] fix ops improting in utils by @wj-Mcat in #7865
- [Add CE] Add CE for Hybrid Parallism by @iosmers in #7817
- [Unified Checkpoint] Cherry pick empty cache. by @ZHUI in #7868
- Add PPO training. by @guoshengCS in #7305
- Update reward_main.py by @wawltor in #7880
- Update ppo_main.py by @wawltor in #7881
- [LLM] revert benchmark codes by @RichardWooSJTU in #7871
- [LLM]support QWenVL second part by @DanGuge in #7808
- [Bug Fixes] update chatglm1 tokenizer by @wj-Mcat in #7870
- 【AutoParallel】Support 'master_grad' in Llama in static auto-parallelism by @heavyrain-lzy in #7658
- [Bug Fix] fix slice bug in LlamaRotaryEmbedding by @MarioLulab in #7882
- 【AutoParallel】Support bf16 loss in static by @heavyrain-lzy in #7874
- [Bug Fix] fix allreduce tensor dtype by @BeingGod in #7876
- [CE] Add Qwen into CE process by @ziangqin-baidu in #7887
- [Hackathon 5th No.73] ToT by @ErnestinaQiu in #7660
- [CustomDevice] fix loading rng state on custom devices by @SylarTiaNII in #7894
- [LLM] ...
v2.7.2
本版本做了一些小问题的修复
What's Changed
- [Unified Checkpoint] fix checkpoint names by @DrownFish19 in #7794
- [Unified Checkpoint] Fix last checkpoint save by @DrownFish19 in #7810
- [PEFT] Cherry pick lora fix by @lugimzzz in #7826
- [Unified Checkpoint] Fix unified checkpoint by empty cache. by @ZHUI in #7855
- [Fix Download] update converted logic & fix hf hub download subfolder bug by @JunnYu in #7911
- [Cherry-pick] logger level by @KB-Ding in #7920
- [Cherry-pick] RuntimeTimer for the toolkit (#7913) by @KB-Ding in #7921
- [Release] 2.7.2 for paddlenlp bugfix. by @ZHUI in #7892
Full Changelog: v2.7.1...v2.7.2
v2.7.1
本版本做了一些小问题的修复
What's Changed
- 修复了训练恢复遇到的一些问题 @ZHUI in #7771
- 修复了GPT在Pipeline模式下的初始化问题 @DrownFish19 in #7775
- 修复了dist dataloader评估时的问题。 @DesmonDay in #7778
Full Changelog: v2.7.0...v2.7.1
PaddleNLP 2.7.0 Release Note
很高兴地通知大家,飞桨大模型套件发布v2.7.0版本。这个版本中,我们深入优化套件的大模型能力。从易用性、性能、到稳定性都有巨大提升。
总体而言,当前版本更新有以下亮点:
- 统一工具链大模型入口。统一预训练、精调、压缩、推理以及部署等环节的实现代码,到 PaddleNLP/llm目录。
- 全新大模型工具链文档。一站式指引用户从大模型入门到业务部署上线。文档见: https://paddlenlp.readthedocs.io/zh/latest/llm/finetune.html
- 全断点存储机制 Unified Checkpoint。 在存储断点时将模型权重、优化器权重等进行统一safetensors格式存储,不再区分分布式策略存储,并且支持恢复训练的动态扩缩容,大大提高大模型存储的通用性。
- 高效微调升级。支持了高效微调+LoRA同时使用,支持了QLoRA等算法。
大模型训推全流程
- 预训练
- 统一了预训练入口到
llm/run_pretrain.py
。 - 支持了qwen 等模型预训练,支持flash attention。
- 统一了预训练入口到
- 精调
- 支持可LoRA + Linear量化同时使用
- 支持了流水线并行模型 + lora一起使用
- 支持了NEFTune方法
- 添加了QLoRA支持
- 压缩
- 支持PTQ、QAT量化功能,包括A8W8、WINT8、WINT4、A8W4
- 支持SmoothQuant、GPTQ、AWQ等量化算法
Unified Checkpoint
- 在大模型背景下,通常我们需要进行多卡分布式的训练,在保存Checkpoint时所得到的模型权重通常是分片放置的,例如根据张量并行、流水线并行进行切分保存。这种根据分布式策略直接存储Checkpoint的方式非常直接明了,但也存在如下的问题:
- 对下游推理不够友好,当用户希望获取中间阶段保存的Checkpoint做下游推理时,需要手动对模型权重进行合并。
- 不利于应对做恢复训练时,可能会面临的分布式策略改变、训练节点数发生变化的情况。用户往往需要手动对Checkpoint进行处理,增加了操作复杂度。
- 为了最大程度地解决上述的问题,降低用户操作难度,我们对大模型存储框架进行了升级,提出了大模型统一存储方案——Unified Checkpoint。Unified Checkpoint的核心思想是将模型权重、优化器权重等进行统一safetensors格式存储,在Checkpoint存储时不再对分布式策略进行区分,提高大模型存储的通用性。
- Unified Checkpoint具备以下功能与特点:
- 权重存储不区分分布式策略,并采用safetensors格式统一存储;
- 灵活支持大模型训练扩容、缩容等各种情况,能够适配不同分布式训练策略的切换。
模型新增
moka-ai/m3e-base
检索模型BAAI/bge-small-zh-v1.5
检索模型
基础框架升级
- Trainer 升级
- 支持了 "--skip_memory_metrics 0"是,显示实时 显存、内存占用
- 支持 "--unified_checkpoint" "--unified_checkpoint_config" 支持混合并行下模型save,动态扩缩容重启。
- 新增 PretrainModelPipe基础类,支持流水线并行训练。
其他支持 - 支持了paddlenlp commit id 展示
paddlenlp.version.commit
- 支持AI Studio download add save to aistudio hub
问题修复
- 修复了dist_dataloader的一些问题
- 修复了一些模型动转静问题
- 修复了GPT训练的一些bug,移除了GPT2。修复了一些seed设置问题
- 修复了baichuan模型在流水线并行的一些问题。
New Contributors
- @Wennie396 made their first contribution in #6897
- @Wong4j made their first contribution in #7008
- @yuanlehome made their first contribution in #7080
- @Xreki made their first contribution in #7105
- @Tom-Zheng made their first contribution in #7092
- @TimeYWL made their first contribution in #7122
- @From00 made their first contribution in #7168
- @RichardWooSJTU made their first contribution in #7186
- @heavyrain-lzy made their first contribution in #7269
- @LokeZhou made their first contribution in #7337
- @JZ-LIANG made their first contribution in #7301
- @WAI-clear made their first contribution in #7402
- @tianhaodongbd made their first contribution in #7293
- @zzjjay made their first contribution in #7504
- @anexplore made their first contribution in #7558
- @niuliling123 made their first contribution in #7528
- @zxcd made their first contribution in #7577
- @MayYouBeProsperous made their first contribution in #7575
- @iosmers made their first contribution in #7613
- @AndSonder made their first contribution in #7343
- @zhink made their first contribution in #7679
- @kingTLE made their first contribution in #7708
Full Changelog: v2.6.1...v2.7.0
v2.6.1
What's Changed
在v2.6.1版本中,我们做了大量的bug修复,提高了LLM模型和相关组件的稳定性。除了bug修复以外,主要新增功能如下:
- LLM:新增了 qwen 模型,InTokens数据流兼容了Pipeline Parallel,LLM精调支持从多个训练文件加载以及热启动,增强了LLaMA模型的不同recompute粒度
- Trainer: hybrid_parallel_topo_order 选项,并修复了 sharding stage3 的保存模型。
- Paddle-pipelines: 添加了对 ERNIE-Bot-turbo和ERNIE-embedding 的支持, 更新了分层搜索示例并且增强了 ChatPaper 的UI
- Megatron 数据集:添加了加载 megatron 数据集的支持,支持ernie-1.0和T5数据类型
New Contributors
- @xiezheng-XD made their first contribution in #6764
- @carryyu made their first contribution in #6676
- @xiaoxiaohehe001 made their first contribution in #6798
- @MARD1NO made their first contribution in #6865
- @zhoutianzi666 made their first contribution in #6905
- @lchdl made their first contribution in #6964
- @LaiXinyi823 made their first contribution in #6659
Full Changelog: v2.6.0...v2.6.1
v2.6.0
PaddleNLP 2.6 正式版本:全新升级,迈进大模型时代!
我们很高兴宣布,PaddleNLP 2.6版本现已全新升级并正式发布!此次升级标志着我们正式迈入了大模型时代。在PaddleNLP 2.6版本中,我们推出了全新的飞桨大语言模型全流程工具链。这套工具链涵盖了预训练、精调、压缩、推理以及部署等环节,为用户提供了一个完整的端到端大模型解决方案。
我们的工具链全面支持LLaMA 1/2, BLOOM, ChatGLM 1/2, GLM, OPT等主流大模型。这使得用户可以在使用同一套工具的前提下,以低成本的方式尝试各种不同的大模型。
为了支持这套大模型工具链,我们进行了大量的底层和基础框架侧的升级:
- 我们将Trainer API升级成为了4D并行分布式Trainer,这让模型的训练过程变得更加高效。
- 我们实现了高效微调算法LoRA/Prefix Tuning,使得单机可以精调千亿级别的模型。
- 同时,我们还依托PaddleSlim的自研量化算法,在所有支持的大模型上全面实现了无损量化。
这些升级都是为了让我们的用户能在大模型时代中更加轻松地进行模型的训练、优化和部署。我们期待你的试用,并期待你的反馈,让我们一起推进PaddleNLP的发展。在2.5版本到2.6版本中PaddleNLP有 40 位新增Contributors,感谢大家对PaddleNLP开源工作的支持!
New Contributors
- @zws-2019 made their first contribution in #5167
- @qiuwenbogdut made their first contribution in #5098
- @kuizhiqing made their first contribution in #5347
- @46319943 made their first contribution in #5419
- @jiaohuix made their first contribution in #5465
- @kangguangli made their first contribution in #5438
- @vivienfanghuagood made their first contribution in #5563
- @zhiboniu made their first contribution in #5470
- @cyber-pioneer made their first contribution in #5598
- @invokerbyxv made their first contribution in #5622
- @megemini made their first contribution in #5658
- @zhenyun-li made their first contribution in #5683
- @solrex made their first contribution in #5736
- @nemonameless made their first contribution in #5487
- @Yulv-git made their first contribution in #5709
- @wangxinxin08 made their first contribution in #5773
- @AlphaHinex made their first contribution in #5815
- @houj04 made their first contribution in #5820
- @Joker1718 made their first contribution in #5816
- @pkuzyc made their first contribution in #5538
- @jadepeng made their first contribution in #5841
- @KB-Ding made their first contribution in #5886
- @parap1uie-s made their first contribution in #5775
- @zirui made their first contribution in #5866
- @GOH-Gu made their first contribution in #5951
- @yangjianfengo1 made their first contribution in #6069
- @zhangting2020 made their first contribution in #5922
- @rogerserper made their first contribution in #6192
- @wtmlon made their first contribution in #6258
- @qingzhong1 made their first contribution in #6251
- @BeingGod made their first contribution in #6307
- @zhiqiu made their first contribution in #6347
- @DesmonDay made their first contribution in #6435
- @cyk1337 made their first contribution in #6447
- @lxp521125 made their first contribution in #6491
- @littsk made their first contribution in #6425
- @RachelXu7 made their first contribution in #6572
- @wanghuancoder made their first contribution in #6539
- @DrownFish19 made their first contribution in #6570
- @GhostScreaming made their first contribution in #6673
Full Changelog: v2.5.2...v2.6.0