08 Oct 08:52

ZHUI

v3.0.0-beta2

81de41a

v3.0.0-beta2 Pre-release

Pre-release

本次更新强化了PaddleNLP的基础设施，新增了Qwen2.5、Mixtral 8*22B模型并升级了Tokenizer功能，同时重命名了数据索引工具。

此外，还修复了MoE模型参数保存与加载等问题，提升了文本处理准确性，并更新了文档与测试用例。在推理性能、硬件支持及自动并行方面也进行了优化，包括支持更多模型与参数配置、多GPU推理、国产硬件支持增强以及分布式训练流程优化等。

核心变更与增强功能

基础设施强化：
- 新增Qwen2.5模型（#9157 ），Mixtral 8*22B。进一步丰富模型库。
- Tokenizer功能升级，现支持加载额外解码标记added_tokens_decoder（#8997 ），提升灵活性。
- 数据索引工具tool_helpers重命名为fast_dataindex（#9134 ），以更直观反映其功能特性。
- 实现训练过程中数据间隔跳过的功能（#8989 ），优化数据处理效率。
- Unified Checkpoint优化：
  - 更新优化器异步保存信号（#8975 ），保证保存稳定。
  - 修复统一检查点中的多项问题（#9082 ），确保功能正确性。
问题修复：
- 解决了MoE模型参数保存与加载的问题（#9045 ）。
- 修正Tokenizer中空格与特殊符号处理的不足（#9010 , #9144 ），提升文本处理准确性。
文档与测试更新：
- 更新多个文档，涵盖LLM模型文档（如#8990 , #8999 ）及量化文档（#9057 ）等，确保信息的时效性与准确性。
- 新增测试用例，如针对PIR模式序列并行的测试（#9015 ），强化测试覆盖度。
- 修复文档中的链接错误（如#9127 ），提升用户体验。
其他关键变更：
- 推理性能优化：
  - LLM推理代码得到优化，支持更多模型与参数配置（如#8986 , #8995 ），拓宽应用场景。
  - 实现Qwen2_Moe多GPU推理（#9121 ）及wint4量化（#9129 ），提升推理效率。
  - 加强LLM推理对FP8与INT8的支持（如#9032 , #9151 ），满足多样化精度需求。
- 硬件支持拓展：
  - 增强对DCU、XPU、MLU等国产硬件的支持（如#8983 , #8504 , #9075 ），促进国产化替代。
  - 优化上述硬件上的模型训练与推理性能，提升整体运算效率。
- 自动并行优化：
  - 修复训练过程中数据重复跳过的问题（#8980 ），确保数据处理的正确性。
  - 更新自动并行配置与检查点转换器（如#8847 , #9136 ），提升并行训练的灵活性与稳定性。
  - 新增损失NaN/Inf检查器（#8943 ），及时发现并处理潜在数值问题。
  - 优化分布式训练中的数据加载与梯度合并流程（如#9120 , #9179 ），提升训练速度与稳定性。

What's Changed

[Unified checkpoint] update optimizer async save signal by @DesmonDay in #8975
更正run_dpo.py文件路径 by @Mangodadada in #8952
fix the loss base in llama_align_dygraph_dy2st_auto_bs2_bf16_DP2-MP1-… by @winter-wang in #8986
[Bug fix] fix skip consumed_samples twice bug by @zhangyuqin1998 in #8980
fix pip error in legacy benchmarks by @fightfat in #8978
【auto_parallel】Add checkpoint convertor by @xingmingyyj in #8847
[llm]update finetune.md by @lugimzzz in #8990
tool_helpers升级后可以支持32766个数据集. by @JunnYu in #8994
add DCU inference docs by @YanhuiDua in #8983
[Distributed]Add loss nan/inf checker by @ForFishes in #8943
【llm】update docs by @lugimzzz in #8999
[Feature] Fused Mixtral support by @penPenf28 in #8901
[XPU] Add README.md for llama2-7b by @xiguapipi in #8979
Add gcu llama readme by @EnflameGCU in #8950
fix qwen model use_casual_mask by @deepllz in #9009
[ZeroPadding] revert zero_padding #8973 by @DrownFish19 in #9003
[LLM Inference] Fix step.cu bug by @yuanlehome in #8995
Refine checkpoint converter by @zhangbo9674 in #9001
[Feature] fused mixtral wint4 by @penPenf28 in #9013
llm inference docs by @Sunny-bot1 in #8976
[LLM Inference] Support Qwen2_Moe Inference Model by @CJ77Qi in #8892
fix llama3 static run by @yuanlehome in #8849
[paddle inference cpu]update cpu inference by @bukejiyu in #8984
fix the tipc ce case by @wawltor in #8748
[Cherry-pick] Add is_distributed field in sharding reshard param_meta by @sneaxiy in #9028
[Tokenizer] Support for loading added_tokens_decoder by @DrownFish19 in #8997
[Inference] Add a8w8(fp8) a8w8c8(int8) quant_type support by @lixcli in #9032
Fix checker of nan/inf by @ForFishes in #9029
[Cherry-pick] add comm buffer size (#8963) by @ForFishes in #9031
[Unified Checkpoint] Update async save info by @DesmonDay in #8982
[llm]support pad to max_length & fix sp bug by @lugimzzz in #9040
[Bugfix] fix bias optional by @penPenf28 in #9037
fix setup.py for llm inference by @yuanlehome in #9041
[Inference] Add cutlass gemm dequant op by @gzy19990617 in #8909
[Inference] update fakequant support by @lixcli in #9047
add test for pir sequence parallel on llama model by @liym27 in #9015
Fix moe save load by @Meiyim in #9045
Update quantization.md by @ZHUI in #9057
【Fix】Initialize dp degree in single GPU by @greycooker in #9056
fix bos download by @westfish in #9023
[Inference] Update fakequant script by @lixcli in #9054
[AutoParallel][PIR] Fit pir grad merge by @AndSonder in #8985
[MLU] Support rms_norm_mlu by @PeiyuLau in #8504
[Inference] support llama3 a8w8c8_fp8 inference and cutlass_fp8_gemm by @ckl117 in #8953
[Inference] Qwen2 support fp8 inference by @ckl117 in #8954
[Version] update version info by @DrownFish19 in #9060
[NPU] Fix baichuan2-13b-chat infer by @ronny1996 in #9070
[MLU] Fix Llama attrntion_mask in npu and mlu by @DrownFish19 in #9075
Fix the memory overflow bug of the tune_cublaslt_gemm operator by @Hanyonggong in #9076
[Inference] Fix weight_only_int4 bug by @lixcli in #9073
[Auto Parallel] fix data stream bug of dist.to_static by @zhangyuqin1998 in #9077
fix hang when Flag_dataloader_use_file_descriptor=True by @deepllz in #9080
fix llm predict install error by @fightfat in #9088
[PIR] add pir grad merge test by @AndSonder in #9074
Update readme by @EnflameGCU in #9046
[LLM] Add tensor parallel for chatglmv2 by @SevenSamon in #9014
[data] update tool_helpers version and add unittest by @JunnYu in #9093
fix baseline because of PR#8769 by @fightfat in #9092
fix use paddle.incubate.jit.inference(model) errors by @chang-wenbin in #9016
[CI] Fix paddlepaddle install by @DesmonDay in #9102
[LLM] fix train on npu by @SylarTiaNII in #9101
Disable ut by @zhangbo9674 in #9108
[AutoParallel] Enable CI for gradclip by @JZ-LIANG in #9059
[Inference] Remove ceval from run_finetune by @lixcli in #9100
[Bugfix] fix multi-gpu infer by @penPenf28 in #9107
【Inference】fix step kernel by @gzy19990617 in #9122
[DCU] fix DCU w8a8c8 GEMM shape by @YanhuiDua in #9115
[Inference] FP8 gemm auto-tune by @ckl117 in #9094
Open ut llama_align_dygraph_dy2st_pir_auto_grad_merge_bs2_fp32_DP1-MP1-PP1 by @zhangbo9674 in #9120
[LLM Inference] Support Qwen2_Moe Inference with MultiGPU by @CJ77Qi in #9121
[Unified Checkpoint] Fix uc lora config, fix release_grads by @DesmonDay in #9082
[Inference]qwen2-a8w8c8 support use_fake_parameter by @ckl117 in #9109
Add fast_ln spmd rules by @From00 in #9125
fix pir dtype by @wanghuancoder in #9130
Remove ring_flash_attention warning by @DrownFish19 in #9119
[DOC] Fix LLM page 404 Not Found by @DrRyanHuang in #9127
Add hardware flops for pretraining by @ZHUI in #9069
[Benchmark] Fix amp level bug in some gpt tests by @zhangbo9674 in #9116
[Auto Parallel] Fix ckpt_converter for auto_parallel by...

Contributors

tizhou86, Meiyim, and 44 other contributors

Assets 2

22 Aug 03:41

ZHUI

v3.0.0-beta1

7473743

v3.0.0-beta1 Pre-release

Pre-release

PaddleNLP从v3.0.0-beta0升级至v3.0.0-beta1版本，带来了多项重要更新与增强。新引入了Yuan、mamba和jamba模型，并优化了LLM推理代码，提升了兼容性和效率。

基础性能优化方面，添加了快速分词器，实现了MoE优化器参数广播，加速了层归一化。同时，修复了多个bug，包括safetensors shape切片问题和Windows下mmap问题，提升了系统稳定性和兼容性。

文档与测试方面，进行了全面更新和优化，确保了文档的准确性和代码的可读性。此外，还增强了国产硬件支持，包括DCU和XPU的优化，以及PIR模式和自动并行的配置更新。

主要变更与新增功能

1. 新模型与特性引入

新模型：在#8654 中引入了Yuan模型；在#8513 和#8517 中分别添加了mamba和jamba新模型，并在后续Pull Request中修复了相关bug，确保了模型的稳定运行。
LLM推理优化：通过多个Pull Request，我们优化了LLM推理代码，并新增了对新模型和参数的支持，进一步提升了推理效率和兼容性。

2. 基础性能优化

快速分词器：在#8832 中，我们添加了基于tokenizers库的快速分词器，显著提升了分词速度和性能。
MoE优化：在#8810 中，我们实现了MoE（Mixture of Experts）优化器参数的广播，有效增强了模型训练的效率。
层归一化加速：通过多个Pull Request，我们添加了fast_rmsnorm，启用了use_fast_layer_norm，并更新了基准测试配置，进一步加速了模型训练过程。特别是在#8717 中，我们支持了在微调过程中使用use_fast_layer_norm，为用户提供了更多灵活性。
训练性能优化：在#8803 中，我们添加了enable_sp_async_reduce_scatter选项，有效优化了训练性能。
字典参数支持：在#8446 中，我们为trainer的argparser添加了支持字典参数的新特性，增强了参数传递的灵活性。同时，在#8904 中，我们更新了tensorboard的要求，确保了与最新版本的兼容性。

3. Bug修复

safetensors修复：在#8702 中，我们修复了safetensors的形状问题。
Windows系统mmap修复：在#8734 中修复了mmap问题，提升了windows的兼容性。
其他Bug修复：包括#8687 、#8730 等多个Pull Request中的bug修复。

4. 文档与测试更新

文档优化：在多个Pull Request中，我们进行了文档更新、代码风格清理和版本信息更新，确保了文档的准确性和可读性。
README修复与增强：在#8741 中，我们修复了README中的断链问题；同时，多个贡献者更新了README文档，添加了新的测试用例，确保了文档与代码的同步更新。

5. 其他重要变更

国产硬件支持增强

DCU支持：在#8580 中，我们实现了针对DCU的高性能LLM训练和推理，拓展了PaddleNLP的硬件支持范围。
XPU优化：在#8527 中，我们为XPU添加了LoRA优化；在#8697 和#8710 中，我们分别实现了XPU的allgather功能和修复了统一检查点的gather问题，进一步提升了XPU上的模型训练效率。

PIR模式支持

导出与加载优化：在#8689 中，我们修改了PIR模式下llama模型的导出方式；在#8712 和#8766 中，我们支持了以三种模式（旧IR、PIR模型文件、PIR JSON文件）加载或保存Llama2-7b模型，为用户提供了更多灵活性和兼容性。

自动并行优化

配置更新：在#8679 中，我们更改了Llama2-7b配置中的max_steps以适应自动并行；在#8767 和#8828 中，我们优化了自动训练器的保存和加载功能；在#8750 中，我们更新了全局剪切的损失函数，进一步提升了自动并行的效率和准确性。

What's Changed

[DCU] high performance LLM train and inference for DCU by @yuguo-Jack in #8580
fix benchmark dir and add CUDA_DEVICE_MAX_CONNECTIONS to qwen by @fightfat in #8678
bug fix by @wtmlon in #8687
[XPU] add lora optimization by @dynamicheart in #8527
[pir save] Modiy export llama model file in pir mode by @xiaoguoguo626807 in #8689
[AutoParallel]Change max_steps in Llama2-7b config for auto-parallel. by @heavyrain-lzy in #8679
[benchmark] Change the mirror source for pip by @mmglove in #8699
update loss base of auto-parallel tests by @zhiqiu in #8701
Add new mistral by @wtmlon in #7425
[Safetensors] Fix safetensors shape by @DesmonDay in #8702
[BUG] num_samples 向下去整, 防止prefrech预取时候超过数据集最大长度... by @JunnYu in #8690
xpu use allgather by @FeixLiu in #8697
add fast_rmsnorm by @deepllz in #8680
enable use_fast_layer_norm for llama2 benchmark by @deepllz in #8714
fix xpu gather for unified ckpt by @FeixLiu in #8710
[inference] support load or save Llama2-7b in three patterns by @lizexu123 in #8712
fix fast_ln backward by @deepllz in #8719
finetune support use_fast_layer_norm by @tianhaodongbd in #8717
bug fix by @FeixLiu in #8730
disable lora by @lugimzzz in #8674
[Safetensors] Fix mmap for Windows system by @DrownFish19 in #8734
correct broken links in readme by @jzhang533 in #8741
revert benchmark fix by @ronny1996 in #8747
[LLM] Add Yuan model by @zhaogf01 in #8654
fix nlp dir and auto_parallel_ci exit -6 by @fightfat in #8744
[LLM] Update sequence parallel linear import by @DrownFish19 in #8706
[Bug fixes] Fix ring attention by @zhangyuqin1998 in #8740
update a100 loss by @zhiqiu in #8708
[PaddleNLP 3.0] Update README by @DrownFish19 in #8681
[AutoParallel] update loss for global clip by @JZ-LIANG in #8750
[NPU] Fix sequence parallel lib import by @DrownFish19 in #8760
[DEV] Update develop version show by @DrownFish19 in #8754
[inference] support load or save Llama2-7b in three patterns by @lizexu123 in #8766
add benchmark baichuan2 scripts by @fightfat in #8683
Add the missing truncation=True in llm/predictor.py by @lszxb in #8768
fix the ce for the unittest by @wawltor in #8772
Enable parallel_config to use commas as delimiters. by @Difers in #8677
fix incorrect token counting in llm/predictor.py by @lszxb in #8769
Refine savable by @ZHUI in #8758
[CodeStyle] remove markdownlint-cli by @DrownFish19 in #8779
[XPU] use allgather and fp32 multinomial for XPU by @houj04 in #8787
fix version show by @DrownFish19 in #8791
[BUG] Add 20 redundant data in post pretrain by @JunnYu in #8789
vera-pissa method added by @TranscenderNing in #8722
update version by @DrownFish19 in #8792
[Inference LLM] refine some code in llama wint8/4 by @yuanlehome in #8796
[DCU] Llama a8w8 inference performance optimization by @Deleter-D in #8800
[Prediction] Update LLM prediction. by @DesmonDay in #8778
[Trainer] Add enable_sp_async_reduce_scatter by @DesmonDay in #8803
[AutoParallel] Refine auto_trainer save load by @zhangbo9674 in #8767
[MoE] Optimizer parameter broadcast by @DesmonDay in #8810
[Doc] Update README by @DrownFish19 in #8817
support Llama3.1 8B 128K generation on single GPU 80GB by @GuoxiaWang in #8811
add paddle nv-embed-v1 by @Li-Z-Q in #8785
fix pad_token_id bug by @yuanlehome in #8814
[DCU] fix llama inference bug on DCU by @Deleter-D in #8815
[Doc] Add LLaMA3.1 by @DrownFish19 in #8824
[BUG] Fix build train valid test datasets by @JunnYu in #8826
Add tune_cublaslt_gemm operator by cublaslt gemm algorithm and generate algo cache file by @Hanyonggong in #8799
fix tune_cublaslt_gemm compile bug by @yuanlehome in #8844
[AutoParallel] Refine save and load ckpt for auto_trainer by @zhangbo9674 in #8828
[Unified Checkpoint] update merge tensor parallel by @DesmonDay in #8856
[Trainer] update clear_grad by @DesmonDay in #8829
[Unified Checkpoint] Fix tie_word_embeddings by @DesmonDay in #8795
[Inference LLM] support static c8 by @yuanlehome in #8833
support sft mapdataset by @greycooker in #8840
Cherry pick some changes from incubate branch by @sneaxiy in #8862
support nested list of dict inputs by @deepllz in #8876
Fix the bug with issues code 8641. by @smallbenxiong in #8880
Fix the issue of P-tuning official sample error by @guangyunms in #8884
modify Paddlemix qwen dytostatic by @xiaoguoguo626807 in #8869
[llm]fix zeropadding by @lugimzzz in #8895
修复fast_ln算子动半开启后报错 by @Wennie396 in #8891
enable_sp_async_reduce_scatter for qwen_72b && llama2_70b by @deepllz in #8897
Update run_pretrain.py by @...

Contributors

jzhang533, zhiqiu, and 41 other contributors

Assets 2

28 Jun 03:05

DrownFish19

v3.0.0-beta0

a2b8a78

v3.0.0-beta0 Latest

Latest

很高兴地通知大家，飞桨大模型套件发布v3.0.0beat版本：拥抱大模型，体验全升级。具体工作如下：

统一大模型工具链，实现国产计算芯片全流程接入；
全面支持飞桨4D并行配置、高效精调策略、高效对齐算法、高性能推理等大模型产业级应用流程；
自研极致收敛的RsLoRA+算法、自动扩缩容存储机制Unified Checkpoint和通用化支持FastFFN、FusedQKV助力大模型训推；
主流模型持续支持更新，提供高效解决方案。

大模型精调对齐训推优化

PEFT：
- 新增scaling策略，支持rslora, pissa算法 in #8256
- 适配FusedQKV和FastFFN参数 in #8372 #8526
DPO：
- 支持DPO（llama，qwen）in #8474
- 支持序列并行 in #7953
国产芯片支持：
- 适配NPU in #8303 #8342 #8359 #8399 #8409 #8401 #8431 #8439 #8438 #8442 #8528 #8642
- 适配XPU in #8282 #8505 #8515 #8588 #8595 #8598
- 适配GCU in #8445 #8470
性能优化：
- 优化Unified Checkpoint机制 in #8204 #8409 #8422 #8512
- 模型并行优化 in #8370
- 序列并行优化 in #8551
- 支持llama3 (wint8|4/a8w8) in #8630
其他
- 新增模型内存监控 in #8269

模型新增

新增Gemma模型 in #8082
- google/gemma-7b
- google/gemma-7b-it
- google/gemma-2b
- google/gemma-2b-it
新增llama3模型 in #8307 #8371
- meta-llama/Meta-Llama-3-8B
- meta-llama/Meta-Llama-3-8B-Instruct
- meta-llama/Meta-Llama-3-70B
- meta-llama/Meta-Llama-3-70B-Instruct
新增Qwen2模型 in #8338 #8584 #8601
- Qwen/Qwen1.5-0.5B
- Qwen/Qwen1.5-0.5B-Chat
- Qwen/Qwen1.5-1.8B
- Qwen/Qwen1.5-1.8B-Chat
- Qwen/Qwen1.5-4B
- Qwen/Qwen1.5-4B-Chat
- Qwen/Qwen1.5-7B
- Qwen/Qwen1.5-7B-Chat
- Qwen/Qwen1.5-14B
- Qwen/Qwen1.5-14B-Chat
- Qwen/Qwen1.5-32B
- Qwen/Qwen1.5-32B-Chat
- Qwen/Qwen1.5-72B
- Qwen/Qwen1.5-72B-Chat
- Qwen/Qwen1.5-110B
- Qwen/Qwen1.5-110B-Chat
- Qwen/Qwen1.5-MoE-A2.7B
- Qwen/Qwen1.5-MoE-A2.7B-Chat
- Qwen/Qwen2-0.5B
- Qwen/Qwen2-0.5B-Instruct
- Qwen/Qwen2-1.5B
- Qwen/Qwen2-1.5B-Instruct
- Qwen/Qwen2-7B
- Qwen/Qwen2-7B-Instruct
- Qwen/Qwen2-72B
- Qwen/Qwen2-72B-Instruct
- Qwen/Qwen2-57B-A14B
- Qwen/Qwen2-57B-A14B-Instruct

基础框架升级

功能优化：
- 支持FusedQKV和FastFFN权重自动融合分割 in #8202 #8378 #8432
- 支持模型并行参数同步设置 in #8311
- 支持RoPE算子设定theta in #8440
- 通信overlap优化 in #8276 #8473 #8499 #8594
AutoParallel优化
- llama支持recompute机制 in #8265
- 适配llama3 in #8395
- position_ids优化 in #8363
- 支持流水线并行split_backward in #8479
- 适配qwen in #8312
分布式能力优化：
- 修复流水线并行中enable_sharding_comm_overlap中参数错误问题 in #8333
- MoE并行支持 in #8498 #8522
chat能力优化：
- 增加Chat template in #8226
其他
- 文档 in #8336 #8393
- 更新nested操作 in #8380
- 随机性更新 in #8450 #8396
- 算子更新 in #8472
- example更新 in #8538

问题修复

修复sharding数量小于100的bug in #8146
修复TP/PP参数合并问题 in #8239
修复tensor.shape与paddle.shape(tensor)不一致问题 in #8260
修复fp16+delay_scale_loss_scale+sharding_stage1_overlap的bug in #8314
增加pipelines运行文档及提示 in #8292 #8308 #8202 #8353
修复text feature extraction任务中tokenizer输入 in #8331
修复import error in #8332 #8367

结构调整

PaddleNLP文件结构调整 in #8609 #8613 #8605 #8614 #8617 #8626 #8618 #8625 #8619 #8629 #8601 #8627 #8666

What's Changed

[dist]pip requirements-dev.txt by @Liujie0926 in #8258
add scaling by @lugimzzz in #8256
[LLM]Support Gemma model by @Southpika in #8082
[BugFix] Try except sequence parallel utils by @DesmonDay in #8189
Update CodeCov GitHub Action by @sijunhe in #8268
[AutoParallel] Open recompute strategy for llama model by @zhangbo9674 in #8265
Fix sharding < 100 limitation bug by @sneaxiy in #8146
use tensor.shape bug not paddle.shape(tensor) by @wanghuancoder in #8260
[dist CI]update paddlenlp install for CI by @Liujie0926 in #8267
[Bug Fix]Fix merge parameters in pp by @Southpika in #8239
[LLM] add memory stats to logger of trainer by @SylarTiaNII in #8269
Add p2p_comm_overlap for Llama-2-70b benchmark. by @Xreki in #8276
add a100 test ground truth by @zhiqiu in #8249
[paddle-pipelines] faq semantic search question answering reamde by @w5688414 in #8292
[paddle-pipelines] Add pipelines documentation by @w5688414 in #8308
Support llama-3 by @ZHUI in #8307
[Distributed] [CustomDevices] Adapt SP on lora && polish MC2 APIs by @SylarTiaNII in #8303
fix bug for fp16 + delay_scale_loss_scale + sharding_stage1_overlap by @FeixLiu in #8314
[paddle-pipelines] Update mkdocs by @w5688414 in #8310
[benchmark]update llama2_ips by @Liujie0926 in #8322
[dist CI]fix before_hook by @Liujie0926 in #8283
benchmark llama worker=1 by @wanghuancoder in #8305
【AutoParallel】Add llama2 UT for auto-parallel by @heavyrain-lzy in #8300
Add system env log for llama test by @zhangbo9674 in #8321
[LLM] Support fuse attention q, k, v weights by @DrownFish19 in #8202
[Distributed] fix lora by @SylarTiaNII in #8325
fix try import by @w5688414 in https://github.com/PaddlePaddle/Pa...

Contributors

zhiqiu, jeff41404, and 49 other contributors

Assets 2

20 Jun 07:42

ZHUI

v2.8.1

db99efd

v2.8.1

What's Changed

[Trainer] Fix sharding overlap bug by @DesmonDay in #8334
[Cherry-pick] update truncate by @KB-Ding in #8375
[BugFix] Fix llama3 eot_id. by @ZHUI in #8373
[Trainer] update distributed dataloader by @DesmonDay in #8426
[BugFix] Fix load rng compatibility. by @ZHUI in #8451
Cherry pick/fast_safe_open by @ZHUI in #8458
【cherry pick】adapter new type promotion rule for Paddle 2.6 by @zxcd in #8463
Quick fix from pretrained. by @ZHUI in #8487
Release/2.8 by @Galaxy1458 in #8437
Fix from_pretrained os.path.split by @DesmonDay in #8508
[fea] Cherry-picked MOE updates from develop by @bo-ke in #8531
[LLM] relocate tensor_parallel_output to avoid conflict (#8419) by @DesmonDay in #8533
Update sequence_parallel for predict by @DesmonDay in #8547
Cp/fix by @ZHUI in #8569
Do not save moe_group by @DesmonDay in #8570
[Release] 2.8.1 by @ZHUI in #8636

Full Changelog: v2.8.0...v2.8.1

Contributors

zxcd, ZHUI, and 4 other contributors

Assets 2

24 Apr 10:04

w5688414

v2.8.0

3105c18

v2.8.0

很高兴地通知大家，飞桨大模型套件发布v2.8.0版本。这个版本中，我们深度优化套件的大模型精调对齐的能力，提升大模型套件在国产计算硬件训推能力，具体工作如下：

特色精调和高效对齐：提供自研极致收敛的RsLoRA+算法，大幅提升PEFT训练收敛速度以及训练效果；引入高性能生成加速到RLHF PPO算法，打破 PPO 训练中生成速度瓶颈，PPO训练性能大幅领先。
大模型训练提速：通用化支持 FastFNN、FusedQKV等多个大模型训练性能优化方式，大模型训练更快、更稳定。

大模型精调对齐训推优化

精调
- PEFT
  - 新增QLoRA pipeline parallel支持 #7801
  - 自定义python算子，优化LoRA的前反向计算 #8106
  - 新增 rslora，lora+，pissa 算法 #8111
- 长序列
  - 新增长序列方案和模型解耦。RotaryEmbedding，LinearScalingRotaryEmbedding，NTKScalingRotaryEmbedding，DynamicNTKScalingRotaryEmbedding等。#8076
- Alignment
  - 新增PPO 对齐算法 #7305
- 训练策略
  - 新增LLaMA sequence parallel #7746
  - 新增LLaMa master_grad #7658
  - GPT新增auto_parallel的支持。 #8160
- 新增算子
  - 新增GQA 算子支持 #7906
  - 新增gqa fuse attention qkv #7890
  - 新增SwiGLU 算子 #8038
推理
- 新增QWenVL 的静态图推理 #7808
  模型新增
新增Deberta，Debertav2模型 #8227
- deepset/deberta-v3-large-squad2
- microsoft/deberta-v2-xlarge
- microsoft/deberta-v3-base
- microsoft/deberta-v3-large
- microsoft/deberta-base
新增mixtral-of-experts #7803
- mistralai/Mixtral-8x7B-Instruct-v0.1
- mistralai/Mixtral-8x7B-v0.1
新增LLama3 #8315
- meta-llama/Meta-llama-3-8b
- meta-llama/Meta-Llama-3-8B-Instruct
- meta-llama/Meta-llama-3-70b
- meta-llama/Meta-Llama-3-70B-Instruct

基础框架升级

Trainer升级
- Trainer新增 ignore_save_lr_and_optim 参数，可以忽略保存lr scheduler以及optimizer权重 #7978
- Trainer新增 Wandb 和 Tensorboard 支持。#7863
- Trainer支持同时解析命令行与json文件参数 #7768
- trainer新增gradient_sync_after_accumulate支持。#8045
- dataloader新增cuda编译检查 #8099
AutoParallel升级
- llama 自动并行支持bf16损失 #7874
- 增加refined-recompute机制#7349
- 在AMP-O2策略下支持master_grad#7658
- 进一步完善动静统一自动并行分布式训练基本功能#7985 #8114
- 新增Llama2模型基于AutoTrainer的半自动训练 #7851 #7885
- 新增llama的hybrid_parallel_topo_order策略。#8011
- llama模型组网动静统一 #8127
其他
- 重构download下载逻辑，支持从bos、hf hub、aistudio、model scope下载模型 #7608 #8020 #8088
- 新增分布式训练的pipeline parallel #8051
- 适配npu的FA #8171 #8210
- llama新增block_attention/cachekv quant #7649

其他支持

新增俄罗斯套娃（matryoshka representation learning）检索策略，节省计算和存储资源。#8165

问题修复

日志级别修改，并增加timelog计时日志，兼容不同设备。#8261
修复pipeline并行中随机初始化的shared weights不一致的问题，覆盖GPT/OPT等模型。#7772
关闭CI及单测中从huggingface hub下载的逻辑 #7798 #8198
修复llm的gradio开启chat template时候重复拼接query 和 history的问题。#7992
修复GPT模型下载key error问题。#8253
修复LlamaRotaryEmbedding #7882
修复allreduce dtype的问题 #7876
修复框架侧dev分支清理 paddle.jit.dy2static.utils_helperAPI的问题 #7989
修复read-data timer在ignore_data_skip=False and skip_profile_timer=False 的问题。#8177
修复Wandb单测问题 #8066 #8056
修复Trainer同时解析json与命令行列表参数报错问题#7860
修复Gradio UI 中的推理问题 #7740 #7788
修复 Tokenizer 相关的基础问题 #7797 7870
修复 custom devices上loading rng state的问题。#7894
修复自动并行打印BF16的loss编码错乱的问题#7874
采用float初始化模型，修复静态图自动并行AMP报错问题#8033#8199
修复ShardDataloader接口在PipeLine Parallelism下使用错误问题#8014
修复llama在custom devices的精度问题。#7895
修复NPU AICPU算子问题 #7976
修复FusedLinearWithGradAdd少传参数的问题。#8178

What's Changed

[Unified Checkpoint] Add unified checkpoint training args doc. by @DesmonDay in #7756
[AutoParallel] Auto Trans PP to VPP by @zhaoyinglia in #7747
Add codecov check by @zjjlivein in #7760
[CE] Delete gpt_for_sequence_classification by @ZHUI in #7757
[DOC] Update trainer.md by @ZHUI in #7761
[Release] Change version to 2.7.0 by @ZHUI in #7764
[benchmark]close skip_memory_metrics for ips by @Liujie0926 in #7732
[Release] Update release.yml to release tags by @ZHUI in #7765
[AutoParallel] Add Sequence Parallel for Static LLaMA by @JZ-LIANG in #7746
[New Features] support dynamic src_length by @wj-Mcat in #7740
Fix unified_checkpoint bug by @DrownFish19 in #7770
[DONE] aistudio, hf hub, bos update download by @JunnYu in #7608
[Trainer] Fix dist dataloader eval by @DesmonDay in #7777
[Paddle-pipelines] Update convert_files_to_dicts_splitter by @w5688414 in #7748
[PEFT]fix lora model tp when existing other trainable module by @lugimzzz in #7781
[Paddle-Pipelines] update faiss by @qingzhong1 in #7793
Fix shared weights sync for PipelineLayer by @DrownFish19 in #7772
[tests] download slow by @JunnYu in #7798
[INFER][LLM] Support qwen in fined grained dybatch v1 by @DanGuge in #7644
Add CE for Distributed Hybrid Parallel by @iosmers in #7782
add MP2-SP2-pp4-vpp2-SD2-stage1-mbs2-acc8 ce by @tianhaodongbd in #7774
[Pretrain] Fix eval during pretrain by @DesmonDay in #7806
pipeline parallel benchmark by @zhangting2020 in #7759
[Bug fixes] fix br gradio by @wj-Mcat in #7788
delete useless code for write_cache_kv.cu by @yuanlehome in #7812
[llm]support qlora pp by @lugimzzz in #7801
Trainer support simultaneously parse JSON files and cmd arguments. by @greycooker in #7768
[LLM] Support block_attention/cachekv quant for llama by @RichardWooSJTU in #7649
[Bug Fix] fix paddle multipy_fwd_func warning message by @BeingGod in #7818
[llm]fix lora by @lugimzzz in #7824
fused rms spmd by @liuzhenhai93 in #7830
[Pretrain] Fix eval during pretrain by @DesmonDay in #7827
[neural search][fix bug of evaluate.py] by @ZeyuTeng96 in #7832
[neural search] fix the bug of reading files when calculating the recall scores by @shenghwa in #7836
[Bug fixes] update chatglm tokenizer by @wj-Mcat in #7797
[semantic_indexing] fix bug of evaluate.py by @ZeyuTeng96 in #7843
[faq] fix bug of evaluate.py by @ZeyuTeng96 in #7840
[text_classification_retrieval_based] fix bug of evaluate.py by @ZeyuTeng96 in #7844
[LLM] add Qwen-7B-Chat to PaddleNLP unit test by @ziangqin-baidu in #7823
Support 5.2 bloom by @zhoutianzi666 in #7846
[unified checkpoint] Fix last checkpoint save by @DrownFish19 in #7854
[unified checkpoint] fix checkpoint names by @DrownFish19 in #7795
[New Features]add ranks testing for test_predictor by @wj-Mcat in #7800
[Auto Parallel] Support dynamic semi-auto training in Llama2 model by @haohongxiang in #7851
[CI] add ci approval pipelines by @zjjlivein in #7859
[fix] fix a bug of trainer/argparser.py by @greycooker in #7860
[Improvement] fix ops improting in utils by @wj-Mcat in #7865
[Add CE] Add CE for Hybrid Parallism by @iosmers in #7817
[Unified Checkpoint] Cherry pick empty cache. by @ZHUI in #7868
Add PPO training. by @guoshengCS in #7305
Update reward_main.py by @wawltor in #7880
Update ppo_main.py by @wawltor in #7881
[LLM] revert benchmark codes by @RichardWooSJTU in #7871
[LLM]support QWenVL second part by @DanGuge in #7808
[Bug Fixes] update chatglm1 tokenizer by @wj-Mcat in #7870
【AutoParallel】Support 'master_grad' in Llama in static auto-parallelism by @heavyrain-lzy in #7658
[Bug Fix] fix slice bug in LlamaRotaryEmbedding by @MarioLulab in #7882
【AutoParallel】Support bf16 loss in static by @heavyrain-lzy in #7874
[Bug Fix] fix allreduce tensor dtype by @BeingGod in #7876
[CE] Add Qwen into CE process by @ziangqin-baidu in #7887
[Hackathon 5th No.73] ToT by @ErnestinaQiu in #7660
[CustomDevice] fix loading rng state on custom devices by @SylarTiaNII in #7894
[LLM] ...

Contributors

co63oc, zhiqiu, and 54 other contributors

Assets 2

30 Jan 07:50

ZHUI

v2.7.2

b39e701

v2.7.2

本版本做了一些小问题的修复

What's Changed

[Unified Checkpoint] fix checkpoint names by @DrownFish19 in #7794
[Unified Checkpoint] Fix last checkpoint save by @DrownFish19 in #7810
[PEFT] Cherry pick lora fix by @lugimzzz in #7826
[Unified Checkpoint] Fix unified checkpoint by empty cache. by @ZHUI in #7855
[Fix Download] update converted logic & fix hf hub download subfolder bug by @JunnYu in #7911
[Cherry-pick] logger level by @KB-Ding in #7920
[Cherry-pick] RuntimeTimer for the toolkit (#7913) by @KB-Ding in #7921
[Release] 2.7.2 for paddlenlp bugfix. by @ZHUI in #7892

Full Changelog: v2.7.1...v2.7.2

Contributors

DrownFish19, ZHUI, and 3 other contributors

Assets 2

04 Jan 14:24

ZHUI

v2.7.1

bb9062e

v2.7.1

本版本做了一些小问题的修复

What's Changed

修复了训练恢复遇到的一些问题 @ZHUI in #7771
修复了GPT在Pipeline模式下的初始化问题 @DrownFish19 in #7775
修复了dist dataloader评估时的问题。 @DesmonDay in #7778

Full Changelog: v2.7.0...v2.7.1

Contributors

DrownFish19, ZHUI, and DesmonDay

Assets 2

03 Jan 04:07

ZHUI

v2.7.0

adf9e6f

PaddleNLP 2.7.0 Release Note

很高兴地通知大家，飞桨大模型套件发布v2.7.0版本。这个版本中，我们深入优化套件的大模型能力。从易用性、性能、到稳定性都有巨大提升。

总体而言，当前版本更新有以下亮点：

统一工具链大模型入口。统一预训练、精调、压缩、推理以及部署等环节的实现代码，到 PaddleNLP/llm目录。
全新大模型工具链文档。一站式指引用户从大模型入门到业务部署上线。文档见： https://paddlenlp.readthedocs.io/zh/latest/llm/finetune.html
全断点存储机制 Unified Checkpoint。在存储断点时将模型权重、优化器权重等进行统一safetensors格式存储，不再区分分布式策略存储，并且支持恢复训练的动态扩缩容，大大提高大模型存储的通用性。
高效微调升级。支持了高效微调+LoRA同时使用，支持了QLoRA等算法。

大模型训推全流程

预训练
- 统一了预训练入口到 llm/run_pretrain.py。
- 支持了qwen 等模型预训练，支持flash attention。
精调
- 支持可LoRA + Linear量化同时使用
- 支持了流水线并行模型 + lora一起使用
- 支持了NEFTune方法
- 添加了QLoRA支持
压缩
- 支持PTQ、QAT量化功能，包括A8W8、WINT8、WINT4、A8W4
- 支持SmoothQuant、GPTQ、AWQ等量化算法

Unified Checkpoint

在大模型背景下，通常我们需要进行多卡分布式的训练，在保存Checkpoint时所得到的模型权重通常是分片放置的，例如根据张量并行、流水线并行进行切分保存。这种根据分布式策略直接存储Checkpoint的方式非常直接明了，但也存在如下的问题：
- 对下游推理不够友好，当用户希望获取中间阶段保存的Checkpoint做下游推理时，需要手动对模型权重进行合并。
- 不利于应对做恢复训练时，可能会面临的分布式策略改变、训练节点数发生变化的情况。用户往往需要手动对Checkpoint进行处理，增加了操作复杂度。
为了最大程度地解决上述的问题，降低用户操作难度，我们对大模型存储框架进行了升级，提出了大模型统一存储方案——Unified Checkpoint。Unified Checkpoint的核心思想是将模型权重、优化器权重等进行统一safetensors格式存储，在Checkpoint存储时不再对分布式策略进行区分，提高大模型存储的通用性。
Unified Checkpoint具备以下功能与特点：
- 权重存储不区分分布式策略，并采用safetensors格式统一存储；
- 灵活支持大模型训练扩容、缩容等各种情况，能够适配不同分布式训练策略的切换。

模型新增

moka-ai/m3e-base 检索模型
BAAI/bge-small-zh-v1.5 检索模型

基础框架升级

Trainer 升级
- 支持了 "--skip_memory_metrics 0"是，显示实时显存、内存占用
- 支持 "--unified_checkpoint" "--unified_checkpoint_config" 支持混合并行下模型save，动态扩缩容重启。
新增 PretrainModelPipe基础类，支持流水线并行训练。
其他支持
支持了paddlenlp commit id 展示 paddlenlp.version.commit
支持AI Studio download add save to aistudio hub

问题修复

修复了dist_dataloader的一些问题
修复了一些模型动转静问题
修复了GPT训练的一些bug，移除了GPT2。修复了一些seed设置问题
修复了baichuan模型在流水线并行的一些问题。

New Contributors

@Wennie396 made their first contribution in #6897
@Wong4j made their first contribution in #7008
@yuanlehome made their first contribution in #7080
@Xreki made their first contribution in #7105
@Tom-Zheng made their first contribution in #7092
@TimeYWL made their first contribution in #7122
@From00 made their first contribution in #7168
@RichardWooSJTU made their first contribution in #7186
@heavyrain-lzy made their first contribution in #7269
@LokeZhou made their first contribution in #7337
@JZ-LIANG made their first contribution in #7301
@WAI-clear made their first contribution in #7402
@tianhaodongbd made their first contribution in #7293
@zzjjay made their first contribution in #7504
@anexplore made their first contribution in #7558
@niuliling123 made their first contribution in #7528
@zxcd made their first contribution in #7577
@MayYouBeProsperous made their first contribution in #7575
@iosmers made their first contribution in #7613
@AndSonder made their first contribution in #7343
@zhink made their first contribution in #7679
@kingTLE made their first contribution in #7708
Full Changelog: v2.6.1...v2.7.0

Contributors

anexplore, zxcd, and 20 other contributors

Assets 2

14 Sep 03:57

sijunhe

v2.6.1

fd2bed5

v2.6.1

What's Changed

在v2.6.1版本中，我们做了大量的bug修复，提高了LLM模型和相关组件的稳定性。除了bug修复以外，主要新增功能如下：

LLM：新增了 qwen 模型，InTokens数据流兼容了Pipeline Parallel，LLM精调支持从多个训练文件加载以及热启动，增强了LLaMA模型的不同recompute粒度
Trainer: hybrid_parallel_topo_order 选项，并修复了 sharding stage3 的保存模型。
Paddle-pipelines: 添加了对 ERNIE-Bot-turbo和ERNIE-embedding 的支持, 更新了分层搜索示例并且增强了 ChatPaper 的UI
Megatron 数据集：添加了加载 megatron 数据集的支持，支持ernie-1.0和T5数据类型