-
Hello, According to the Sagemaker Doc, "When using data parallelism with DDP, the So what's the truth/best practice here ? Thanks |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 4 replies
-
Hi, The example you are looking at is using the native PyTorchs DistributedDataParallel modules. The SageMaker doc you looked at is about the SageMaker's model parallel library. The doc is saying that SageMaker's model parallel library does not support the native PyTorch's DistributedDataParallel modules. Instead, you can use SageMaker's data parallel library that's compatible with the model parallel library. The SageMaker data parallel library provides DDP modules equivalent to The two SageMaker distributed training libraries are benchmark tested and can help improve your model training on AWS infrastructure. Are you looking for an example of running a PyTorch training job with model parallelism? |
Beta Was this translation helpful? Give feedback.
Hi,
The example you are looking at is using the native PyTorchs DistributedDataParallel modules.
The SageMaker doc you looked at is about the SageMaker's model parallel library. The doc is saying that SageMaker's model parallel library does not support the native PyTorch's DistributedDataParallel modules. Instead, you can use SageMaker's data parallel library that's compatible with the model parallel library.
The SageMaker data parallel library provides DDP modules equivalent to
torch.nn.parallel.DistributedDataParallel
. The SageMaker's distributed data parallelism modules and how to adapt your PyTorch training scripts are documented in https://docs.aws.amazon.com/sagemaker/latest/dg/data…