-
Notifications
You must be signed in to change notification settings - Fork 2.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
【auto_parallel】Add checkpoint convertor #8847
Conversation
Thanks for your contribution! |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## develop #8847 +/- ##
===========================================
- Coverage 55.29% 54.37% -0.93%
===========================================
Files 631 648 +17
Lines 98888 103828 +4940
===========================================
+ Hits 54681 56457 +1776
- Misses 44207 47371 +3164 ☔ View full report in Codecov by Sentry. |
5fc66c2
to
280b13f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
* Add the checkpoint conversion module
PR types
New features
PR changes
Others
Description
添加checkpoint转换模块,将 hybrid parallel 模式存储的分布式 checkpoint 自动且 inplace 的加载到 auto parallel state_dict 上
使用方法
首先需要构造一个CheckpointConvertor对象,向其提供hybrid parallel 模式存储的分布式 checkpoint的路径,以及 inplace 的加载到 auto parallel模式的state_dict,以及记录优化器状态名称到模型参数名称的_parameter_to_structured_name映射。然后调用load_from_hybrid_parallel_checkpoint函数,可以完成hybrid parallel模式下存储的checkpoint的加载。
动手 -> 静半权重自动转换支持情况
动手 -> 静半权重自动转换的执行流程
以mp2-pp2-sharding2 stage1 v2模式为例,介绍checkpoint转换模块的主要工作流程。在该并行模式下,主要解决的问题是优化器状态被flatten之后切分到不同的卡上,并且该模式下不会存储model_state,需要通过master_weight进行cast。首先convertor需要首先拼回被sharding切分的tensor,然后执行reshape,最后拼回tp切分的tensor。然后生成meatdata。主要流程如下:
精度对比
step4: 11.94781113
step5: 11.82280064
step4: 11.9477644
step5: 11.82279968
step4: 11.93180943
step5: 11.82863617
step4: 11.93180943
step5: 11.82863617
step4: 11.93180943
step5: 11.82863617
step4: 11.93180943
step5: 11.82863617
step4: 11.93180943
step5: 11.82863617
step4: 11.93180943
step5: 11.82863617
step4: 11.96230888
step5: 11.86537361
step4: 11.96230888
step5: 11.8653736
step4: 10.68730164
step5: 10.54143333
step4: 10.68730164
step5: 10.54143238
11.97954369
11.91608047
11.99548531
11.93254852
12.00198746
11.96591187
12.00198746
11.96591187
11.97954369
11.91608047
11.99544525
11.96228504
11.97954369
11.91608047
11.97954369
11.91608047
11.96553135
11.87691593
11.97389889
11.93051529
11.96553135
11.87691498
12.00613785
11.96413231
11.99377346
11.83226585