Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Sharding] add new features #28568

Merged
merged 67 commits into from
Nov 18, 2020
Merged

[Sharding] add new features #28568

merged 67 commits into from
Nov 18, 2020

Conversation

JZ-LIANG
Copy link
Contributor

@JZ-LIANG JZ-LIANG commented Nov 12, 2020

PR types

New features

PR changes

OPs

Describe

  1. fixed bug
    op role
    add_sync_comm_for_test: fixed bug when clone test prog
  2. add new features
    comm_analyse: calculating communication volume
    sharding_save_persistables: save sharding model

usage:

import paddle
import paddle.fluid as fluid
import fleetx as X
from paddle.distributed.fleet.meta_optimizers.sharding.utils import add_sync_comm_for_test, sharding_save_persistables, comm_analyse

dist_strategy = fleet.DistributedStrategy()
dist_strategy.sharding = True
dist_strategy.sharding_configs = {
    "fuse_broadcast_MB": 32,
}

model = X.applications.Resnet50(data_layout=args.data_layout)
optimizer = fluid.optimizer.Momentum(
    learning_rate=lr,
    momentum=args.momentum,
    regularization=fluid.regularizer.L2Decay(args.weight_decay))
optimizer = fleet.distributed_optimizer(optimizer, strategy=dist_strategy)
optimizer.minimize(model.loss)

# Clone test program
# when use sharding, test prog clone should be performed after optimizer.minimize(model.loss)
model.test_prog = model.main_prog.clone(for_test=True)
add_sync_comm_for_test(model.test_prog, dist_strategy)

place = fluid.CUDAPlace(int(os.environ.get('FLAGS_selected_gpus', 0)))
exe = fluid.Executor(place)
exe.run(model.startup_prog)

# Analyse COMM 
comm_analyse(fluid.default_main_program())

# Load model
# we could use the orign load_persistables for sharding model
# make sure the dirname contains all the param files for every rank
dirname="/path/to/load_model"  
paddle.fluid.io.load_persistables(exe, dirname, main_program=model.main_prog, filename=None)

# Training
for epoch in range(10):
......

# Save model
# every rank should execute the following function to save a complete sharding model, 
# unlike data parallelism where only the rank0 handles the model saving
dirname="/path/to/save_model"  
sharding_save_persistables(exe, dirname, main_program=model.main_prog, filename=None)

@paddle-bot-old
Copy link

Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

@fuyinno4 fuyinno4 merged commit 5a9f688 into PaddlePaddle:develop Nov 18, 2020
return


def sharding_save_persistables(exe, dirname, main_program, filename=None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change the name

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants