[PyTorch] Should I use DistributedDataParallel ? #2862

RobinFrcd · 2021-08-06T11:04:08Z

RobinFrcd
Aug 6, 2021

Hello,
I'm currently taking a look at hyperparameter_tuning/pytorch_mnist/mnist.py. In this example, there's the use of torch.nn.parallel.DistributedDataParallel(model).

According to the Sagemaker Doc, "When using data parallelism with DDP, the DistributedDataParallel wrapper is not supported."

So what's the truth/best practice here ?

Thanks

Answered by mchoi8739

Aug 9, 2021

Hi,

The example you are looking at is using the native PyTorchs DistributedDataParallel modules.

The SageMaker doc you looked at is about the SageMaker's model parallel library. The doc is saying that SageMaker's model parallel library does not support the native PyTorch's DistributedDataParallel modules. Instead, you can use SageMaker's data parallel library that's compatible with the model parallel library.

The SageMaker data parallel library provides DDP modules equivalent to torch.nn.parallel.DistributedDataParallel. The SageMaker's distributed data parallelism modules and how to adapt your PyTorch training scripts are documented in https://docs.aws.amazon.com/sagemaker/latest/dg/data…

View full answer

mchoi8739 · 2021-08-09T17:05:05Z

mchoi8739
Aug 9, 2021
Maintainer

Hi,

The example you are looking at is using the native PyTorchs DistributedDataParallel modules.

The SageMaker doc you looked at is about the SageMaker's model parallel library. The doc is saying that SageMaker's model parallel library does not support the native PyTorch's DistributedDataParallel modules. Instead, you can use SageMaker's data parallel library that's compatible with the model parallel library.

The SageMaker data parallel library provides DDP modules equivalent to torch.nn.parallel.DistributedDataParallel. The SageMaker's distributed data parallelism modules and how to adapt your PyTorch training scripts are documented in https://docs.aws.amazon.com/sagemaker/latest/dg/data-parallel-modify-sdp-pt.html.
Basically, you need to replace the native PT DistributedDataParallel to from smdistributed.dataparallel.torch.parallel.distributed import DistributedDataParallel as DDP, and make a few more modifications to wrap modules of your model (optimizer, model, etc).

The two SageMaker distributed training libraries are benchmark tested and can help improve your model training on AWS infrastructure.

Are you looking for an example of running a PyTorch training job with model parallelism?

4 replies

RobinFrcd Aug 9, 2021
Author

Thank you for your answer !

To be honest, my model is pretty small, but I have a lot of data. I'm looking for an example to run a PyTorch training with data-parallelism that could run both on a local machine with 2 GPUs and on Sagemaker.

Thanks!

mchoi8739 Aug 9, 2021
Maintainer

I see. Then I would recommend you to have a look at these docs and examples of using SageMaker's data parallelism.

Let me know if you have further questions!

mchoi8739 Aug 9, 2021
Maintainer

ps. the SageMaker distributed data parallel library is optimized to run SageMaker training jobs (TF, PT, HuggingFace) on EC2 instances (EFA enabled) with 8 GPUs.

RobinFrcd Aug 10, 2021
Author

Thank you again for your answer, I'll go through all the doc then !

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PyTorch] Should I use DistributedDataParallel ? #2862

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 4 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

[PyTorch] Should I use DistributedDataParallel ? #2862

RobinFrcd Aug 6, 2021

Replies: 1 comment · 4 replies

mchoi8739 Aug 9, 2021 Maintainer

RobinFrcd Aug 9, 2021 Author

mchoi8739 Aug 9, 2021 Maintainer

mchoi8739 Aug 9, 2021 Maintainer

RobinFrcd Aug 10, 2021 Author

RobinFrcd
Aug 6, 2021

Replies: 1 comment 4 replies

mchoi8739
Aug 9, 2021
Maintainer

RobinFrcd Aug 9, 2021
Author

mchoi8739 Aug 9, 2021
Maintainer

mchoi8739 Aug 9, 2021
Maintainer

RobinFrcd Aug 10, 2021
Author