Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

【auto_parallel】Add checkpoint convertor #8847

Merged
merged 31 commits into from
Aug 22, 2024

Conversation

xingmingyyj
Copy link
Contributor

@xingmingyyj xingmingyyj commented Jul 31, 2024

PR types

New features

PR changes

Others

Description

添加checkpoint转换模块,将 hybrid parallel 模式存储的分布式 checkpoint 自动且 inplace 的加载到 auto parallel state_dict 上

使用方法

首先需要构造一个CheckpointConvertor对象,向其提供hybrid parallel 模式存储的分布式 checkpoint的路径,以及 inplace 的加载到 auto parallel模式的state_dict,以及记录优化器状态名称到模型参数名称的_parameter_to_structured_name映射。然后调用load_from_hybrid_parallel_checkpoint函数,可以完成hybrid parallel模式下存储的checkpoint的加载。

CheckpointConverter(
                    resume_from_checkpoint, state_dict, parameter_to_structured_name
                ).load_from_hybrid_parallel_checkpoint()

动手 -> 静半权重自动转换支持情况

动手策略 dp mp pp sharding_stage1/2_v1 sharding_stage1_v2 sharding_stage3 mp+pp+vpp mp+sharding_stage1_v1 mp+pp+sharding_stage1_v1 mp+pp+sharding_stage1_v2 (save_sharded_model=true(Necessary))
是否支持

动手 -> 静半权重自动转换的执行流程

以mp2-pp2-sharding2 stage1 v2模式为例,介绍checkpoint转换模块的主要工作流程。在该并行模式下,主要解决的问题是优化器状态被flatten之后切分到不同的卡上,并且该模式下不会存储model_state,需要通过master_weight进行cast。首先convertor需要首先拼回被sharding切分的tensor,然后执行reshape,最后拼回tp切分的tensor。然后生成meatdata。主要流程如下:

流程图-202408131535

精度对比

  • 分布式策略对齐
分布式策略 动手loss 静半加载动手loss
Llama2 pp4-mp2 step3: 11.97313595
step4: 11.94781113
step5: 11.82280064
step3: 11.97313595
step4: 11.9477644
step5: 11.82279968
Llama2 pp2-mp2-sharding2 stage1 v1 save_sharded_model step3: 11.96802998
step4: 11.93180943
step5: 11.82863617
step3: 11.96803093
step4: 11.93180943
step5: 11.82863617
Llama2 pp2-mp2-sharding2 stage1 v1 step3: 11.96802998
step4: 11.93180943
step5: 11.82863617
step3: 11.96803093
step4: 11.93180943
step5: 11.82863617
Llama2 pp2-mp2-sharding2 stage1 v2 save_sharded_model step3: 11.96802998
step4: 11.93180943
step5: 11.82863617
step3: 11.96803093
step4: 11.93180943
step5: 11.82863617
Llama2 mp4-sharding2 stage2 v1 step3: 12.13635731
step4: 11.96230888
step5: 11.86537361
step3: 12.13635635
step4: 11.96230888
step5: 11.8653736
Llama2 shading 3 step3: 10.98984146
step4: 10.68730164
step5: 10.54143333
step3: 10.98984146
step4: 10.68730164
step5: 10.54143238
  • 更换分布式策略加载
动手配置 动手 Loss 静半配置 静半 Loss (step3热启)
sharding2 stage1 v2 mp2 pp2 12.05957413
11.97954369
11.91608047
sharding4 stage1 pp2 12.02306652
11.99548531
11.93254852
sharding2 stage1 pp2 12.0893755
12.00198746
11.96591187
sharding2 stage2 pp2 12.0893755
12.00198746
11.96591187
sharding2 stage1 mp2 pp2 12.05957413
11.97954369
11.91608047
sharding2 stage1 mp4 11.88536358
11.99544525
11.96228504
sharding2 stage1 mp2 12.05957413
11.97954369
11.91608047
sharding2 stage1 mp2 pp2 12.05957413
11.97954369
11.91608047
mp2 pp4 11.98708057
11.96553135
11.87691593
mp4 pp2 11.95367813
11.97389889
11.93051529
mp2 pp4 11.98708057
11.96553135
11.87691498
mp2 pp2 11.98708057
12.00613785
11.96413231
pp2 12.00666428
11.99377346
11.83226585

Copy link

paddle-bot bot commented Jul 31, 2024

Thanks for your contribution!

paddlenlp/trainer/checkpoint_converter.py Outdated Show resolved Hide resolved
paddlenlp/trainer/checkpoint_converter.py Outdated Show resolved Hide resolved
paddlenlp/trainer/checkpoint_converter.py Outdated Show resolved Hide resolved
Copy link

codecov bot commented Jul 31, 2024

Codecov Report

Attention: Patch coverage is 0% with 776 lines in your changes missing coverage. Please review.

Project coverage is 54.37%. Comparing base (4ebec1d) to head (dfb16b7).
Report is 43 commits behind head on develop.

Files Patch % Lines
paddlenlp/trainer/utils/ckpt_converter.py 0.00% 755 Missing ⚠️
paddlenlp/trainer/auto_trainer.py 0.00% 21 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #8847      +/-   ##
===========================================
- Coverage    55.29%   54.37%   -0.93%     
===========================================
  Files          631      648      +17     
  Lines        98888   103828    +4940     
===========================================
+ Hits         54681    56457    +1776     
- Misses       44207    47371    +3164     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@xingmingyyj xingmingyyj changed the title convert_checkpoint 【auto_parallel】Adding checkpoint convertor Aug 7, 2024
@xingmingyyj xingmingyyj changed the title 【auto_parallel】Adding checkpoint convertor 【auto_parallel】Add checkpoint convertor Aug 7, 2024
paddlenlp/trainer/checkpoint_converter.py Outdated Show resolved Hide resolved
paddlenlp/trainer/checkpoint_converter.py Outdated Show resolved Hide resolved
paddlenlp/trainer/checkpoint_converter.py Outdated Show resolved Hide resolved
paddlenlp/trainer/checkpoint_converter.py Outdated Show resolved Hide resolved
paddlenlp/trainer/checkpoint_converter.py Outdated Show resolved Hide resolved
paddlenlp/trainer/checkpoint_converter.py Outdated Show resolved Hide resolved
paddlenlp/trainer/checkpoint_converter.py Outdated Show resolved Hide resolved
zhiqiu
zhiqiu previously approved these changes Aug 15, 2024
Copy link
Collaborator

@zhiqiu zhiqiu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

paddlenlp/trainer/ckpt_converter.py Outdated Show resolved Hide resolved
paddlenlp/trainer/ckpt_converter.py Outdated Show resolved Hide resolved
paddlenlp/trainer/ckpt_converter.py Outdated Show resolved Hide resolved
paddlenlp/trainer/training_args.py Outdated Show resolved Hide resolved
paddlenlp/trainer/ckpt_converter.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@ZHUI ZHUI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ZHUI ZHUI merged commit aaacb32 into PaddlePaddle:develop Aug 22, 2024
9 of 12 checks passed
Mangodadada pushed a commit to Mangodadada/PaddleNLP that referenced this pull request Sep 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants