-
-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to implement model parallelism using PyTorch on an HPC environment? #896
Comments
Hi @Akshara211, I'm not familiar with that topic that much, but I'm curious what is the motivation of doing it? What model configuration are you using? |
Hi @qubvel , Suggestions from various sources recommend implementing model parallelism by distributing the model across different GPUs. How are you training the ResNet-152 model, and what GPU specifications are you using for ResNet-152? I have 4 GPUs available. |
I would probably recommend using
all these things are really easy to incorporate with Pytorch Lightning, just with a few flags provided to the Trainer class |
Thank you for the suggestions! We are already using Distributed Data Parallel (DDP), so data parallelism is applied. However, we are still facing issues with ResNet152. Our HPC GPU specification is 4x NVIDIA A100-SXM4-40GB. Could you please advise on the specific flags to use for enabling lower precision (float16/bfloat16) and gradient accumulation in PyTorch Lightning? Is there anything else I could do? Any other suggestions from your side would be appreciated. Thank you! |
Lower precision training Gradient accumulation |
How do I implement distributed data parallelism (for multi-GPU training)? |
@plo97 You can use Pytorch-Lightning for that too All you need is: # train on 8 GPUs (same machine (ie: node))
trainer = Trainer(accelerator="gpu", devices=8, strategy="ddp") See how to train SMP model with PyTorch-Lightning here: |
Hi @qubvel Additionally, will using these techniques cause any change in accuracy compared to normal training? Here is my trainer configuration: trainer = pl.Trainer(accelerator='gpu', max_epochs=epochs,callbacks=[checkpoint_callback, early_stopping_callback], devices=4, strategy="ddp",precision="bf16-mixed",accumulate_grad_batches= 4) |
Hi @Akshara211 sorry for the late response, I probably missed the notification, were you able to solve your problem?
Regarding bfloat16, it depends on your GPU, for some GPUs you will find it faster, while for others it might be slower. Gradient accumulation should not slow down your training.
It depends on the careful setup, but in general, it should not change the accuracy. However, you might want to set up sync batchnorm in case you have a small batch size per one GPU (alternatively you can freeze batchnorms). |
As far as I know, no library provide automatic tools for model parallelism (FastAI, PyTorch, TensorFlow...), it's up to you to divide and send the appropriate layers to each GPU and synchronize. For the specific case of smp, I have been able to do both, Data and Model parallelism using the Unet with resnet34 as encoder. I was also able to test it with 2 GPUs on the same machine, but briefly. Both launched using torchrun. About the execution time you mentioned, when doing DP training time should be reduced, for MP it should be similar (if your GPUs are in the same machine) or higher if (as my case) you have a distributed environment due to the overhead |
Hi @Patataman! Thanks a lot for sharing your experience. In case you have time, is it possible to share any code examples how this can be implemented? I would appreciate any details and contributions to make this question more clear for the community, thanks! |
I think I can arrange a minimal example for DP and MP (this one when the GPUs are on the same node) . On the other hand, distributed MP require more changes and knowledge about the library (in this case RPC), but I can share an example with pseudo code that should help to understand how to do it |
Sounds great! |
This example would be for data parallelism import torch
import numpy as np
import segmentation_models_pytorch as smp
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP # For DP
from torch.utils.data import Dataset
from torch.utils.data.sampler import SubsetRandomSampler
from tqdm import tqdm
import os
import random
from pathlib import Path
TRAIN_PROB = 1.0
TEST_PROB = 0.0
TRAINSPLIT = 0.8
VALIDSPLIT = 0.2
if __name__ == "__main__":
rank = int(os.environ.get('RANK',0))
world_size = int(os.environ.get('WORLD_SIZE',1))
DEVICE = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
dist.init_process_group("nccl", rank=rank, world_size=world_size)
""" Preprocess paths
"""
unet = smp.Unet(
encoder_name="resnet34",
in_channels=1,
classes=1,
activation="sigmoid"
)
unet = DDP(unet)
lr = 0.001
batch_size = 4
n_epochs = 20
opt = torch.optim.Adam(unet.parameters(), lr=lr)
loss = smp.losses.MCCLoss()
# Random image tensors and labels
n_samples = 100
image_size = 128
n_channels = 1
n_classes = 2
X = torch.randn(n_samples, n_channels, image_size, image_size)
y = torch.randint(0, 1, (n_samples, image_size, image_size))
train_data = list(zip(X,y))
# https://stackoverflow.com/questions/50544730/how-do-i-split-a-custom-dataset-into-training-and-test-datasets
train_subset, valid_subset = torch.utils.data.random_split(train_data, [TRAINSPLIT,VALIDSPLIT])
train_sampler = torch.utils.data.distributed.DistributedSampler(
train_subset, num_replicas=world_size, rank=rank
)
train_dataloader = torch.utils.data.DataLoader(
train_subset, sampler=train_sampler, batch_size=batch_size
)
valid_sampler = SubsetRandomSampler(valid_subset.indices)
valid_dataloader = torch.utils.data.DataLoader(
train_data, batch_size=batch_size, sampler=valid_sampler
)
for epoch in tqdm(range(n_epochs), leave=False):
# Here you should have your train and validation loop
# I used a custom library, that's why it is not included, hehe
dist.destroy_process_group() In the case you want to do MP in the same node (multiple GPUs in the same machine), you can easily do that as os.environ['MASTER_ADDR'] = 'localhost'
os.environ['MASTER_PORT'] = '52355'
dist.rpc.init_rpc('worker', rank=0, world_size=1)
[...]
unet = smp.Unet(
encoder_name="resnet34",
in_channels=1,
classes=1,
activation="sigmoid"
)
# I cannot test this right now bc I don't have access to a machine with multiple GPUs, but this should be all the changes
unet.encoder.to("cuda:0")
unet.decoder.to("cuda:1")
unet.segmentation_head.to("cuda:1")
[...] And finally, MP with multiple nodes. Luckily I have the code for splitting the SMP's Unet for MP using RPC. import torch
import torch.optim as optim
import torch.distributed.rpc as rpc
import torch.distributed as dist
from torch.distributed.nn import RemoteModule
from torch.distributed.rpc import RRef, TensorPipeRpcBackendOptions
########################
from segmentation_models_pytorch.encoders import get_encoder
from segmentation_models_pytorch.decoders.unet.decoder import UnetDecoder
from segmentation_models_pytorch.base import SegmentationHead
from segmentation_models_pytorch.base.initialization import initialize_decoder, initialize_head
#####################
DEVICE = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
class RPCUnet(torch.nn.Module):
# This is like "the main model" for training and inference
def __init__(self, remote_encoder, remote_decoder):
super().__init__()
# Define your layers here, for example:
self.remote_encoder = remote_encoder
self.remote_decoder = remote_decoder
kwargs_seghead = { # from smp source file
"in_channels": 16, #kwargs_decoder["decoder_channels"][-1],
"out_channels": 1,
"activation": None,
"kernel_size": 3
}
self.segmentation_head = SegmentationHead(**kwargs_seghead).to(DEVICE)
initialize_head(self.segmentation_head)
self.sigmoid = torch.nn.Sigmoid().to(DEVICE)
def forward(self, x):
x = x.to("cpu") # RPC only works with cpu
x = self.remote_encoder(x)
x = self.remote_decoder(x)
x = torch.stack(x).to("cuda")
x = self.segmentation_head(x)
x = self.sigmoid(x)
return x
def get_rref_parameters(self):
rrefs = self.remote_encoder.remote_parameters()
rrefs.extend(self.remote_decoder.remote_parameters())
for param in self.segmentation_head.parameters():
rrefs.append(RRef(param))
for param in self.sigmoid.parameters():
rrefs.append(RRef(param))
return rrefs
class DecoderRPC(torch.nn.Module):
""" Superclass to initialize Unet decoder in remote worker
"""
def __init__(self):
super().__init__()
self.decoder = UnetDecoder(
encoder_channels=(3, 64, 64, 128, 256, 512),
decoder_channels=(256, 128, 64, 32, 16),
n_blocks=5,
use_batchnorm=True,
center=False,
attention_type=None
)
initialize_decoder(self.decoder)
def forward(self, x):
with torch.autocast(device_type='cuda', dtype=torch.float16, enabled=AMP):
x = [_x.to(DEVICE) for _x in x]
x = self.decoder(*x)
return x
def train(self, mode: bool = True):
self.decoder.to("cpu")
super().train(mode)
self.decoder.to("cuda")
class EncoderRPC(torch.nn.Module):
""" Superclass to initialize Unet decoder in remote worker
"""
def __init__(self):
super().__init__()
self.encoder = get_encoder(
"resnet34",
in_channels=1,
depth=5, # from smp source file
weights="imagenet"
)
def forward(self, x):
with torch.autocast(device_type='cuda', dtype=torch.float16, enabled=AMP):
x = x.to(DEVICE)
x = self.encoder(x)
return x
def train(self, mode: bool = True):
self.encoder.to("cpu")
super().train(mode)
self.encoder.to("cuda")
def make_modelRPC():
# Build each Unet section individually to allow to use RemoteModule
# with them.
remote_encoder = RemoteModule(
"worker1/cuda",
EncoderRPC,
)
remote_decoder = RemoteModule(
"worker2/cuda",
DecoderRPC,
)
layers = [remote_encoder, remote_decoder]
rpc_model=RPCUnet(*layers)
[...]
def init_worker(rank, world_size):
rpc_backend_options = TensorPipeRpcBackendOptions(
init_method = f"tcp://{os.environ['MASTER_ADDR']}:52355",
)
# Master
if rank == 0:
rpc.init_rpc(
"master",
rank=rank,
world_size=world_size,
rpc_backend_options=rpc_backend_options,
)
rpc_model = make_modelRPC()
# Just load your data and train as when using 1 GPU
[ ... ]
elif rank > 0: # in [1,2]:
# Initialize RPC.
worker_name = "worker{}".format(rank)
rpc.init_rpc(
worker_name,
rank=rank,
world_size=world_size,
rpc_backend_options=rpc_backend_options,
)
# Worker just waits for RPCs from master.
rpc.shutdown()
print(rank, "RPC shutdown.")
if __name__ == "__main__":
rank = int(os.environ['RANK'])
world_size = int(os.environ['WORLD_SIZE'])
init_worker(rank, world_size) For MP in a distributed environment you also need to use distributed autograd and optimizer as stated here: https://pytorch.org/docs/stable/rpc.html |
@Patataman thanks a lot for taking the time to write these code samples! |
Btw, I two things that might come handy when testing: In my case, when running MP with RPC I had to manually set the network interface of each machine putting The second things is that I think there is no actual need of manually creating each part of the model manually as I did in the code I shared. By the time, that was the only way I managed to get it working, but I think something like this
I think that should work. I don't remember now why it didn't work for me by the time, but I have been able to do similar things with other models. The only limit for RPC (afaik) it's that the data must be pickleable |
Hello,
I am trying to implement model parallelism using PyTorch on my HPC environment, which has 4 GPUs available. My goal is to split a neural network model across these GPUs to improve training efficiency.
Here's what I've tried so far:
Followed the PyTorch documentation on model parallelism
Implemented a basic split of the model across GPUs
However, I am encountering performance bottlenecks and underutilization of the GPUs. Can someone guide me on how to implement this in my HPC setup?
Any advice or pointers to resources would be greatly appreciated!
The text was updated successfully, but these errors were encountered: