Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Show a warning if current device index is lower than current local rank #1308

Closed
vfdev-5 opened this issue Sep 22, 2020 · 2 comments · Fixed by #1335 or #1376
Closed

Show a warning if current device index is lower than current local rank #1308

vfdev-5 opened this issue Sep 22, 2020 · 2 comments · Fixed by #1335 or #1376
Assignees

Comments

@vfdev-5
Copy link
Collaborator

vfdev-5 commented Sep 22, 2020

🚀 Feature

Following #1307, if user does not set torch.cuda.set_device("cuda:lrank"), ignite's code

def _compute_nproc_per_node(self):
tensor = torch.tensor([self.get_local_rank() + 1]).to(self.device())
dist.all_reduce(tensor, op=dist.ReduceOp.MAX)
return tensor.item()

will use the same device cuda:0 for all_reduce op.

For older NCCL, it will setup itself such that i-th proc uses cuda:0 device and thus following collective op will hang with other devices. For example

import torch
import torch.distributed as dist

def main():

    # !!! We do not call torch.cuda.set_device("cuda:lrank")

    dist.init_process_group(backend="nccl", init_method="env://")
    import os
    local_rank = int(os.environ["LOCAL_RANK"])

    tensor = torch.tensor([local_rank + 1]).to("cuda")
    dist.all_reduce(tensor, op=dist.ReduceOp.MAX)
    print(tensor)

    tensor = torch.tensor([local_rank + 1]).to("cuda:{}".format(local_rank))
    # PROGRAM WILL HANG HERE >>>>
    dist.all_reduce(tensor, op=dist.ReduceOp.MAX)
    print(tensor)

    dist.destroy_process_group()

if __name__ == "__main__":
    main()

For newer NCCL, it raises the error as in #1307.

Let's improve the code by raising a warning for native and horovod dist models when calling idist.device() if we encounter the situation where current cuda device index is smaller than the local rank.
PyTorch docs suggest that to use 1 proc per 1 cuda device => local rank should be equal to cuda device index.
However, it is also possible to have M procs with K devices / proc (e.g. 4 procs with 2 GPUs per proc) => local rank <= cuda device index.

@HelioStrike
Copy link
Contributor

HelioStrike commented Oct 2, 2020

@vfdev-5 So the following code needs to be added in native.py and horovod.py.

if index < self.get_local_rank():
    warnings.warn("Current device index is less than current local rank.")

Is there anything else that needs to be done? Made a PR for the changes so far.

@vfdev-5
Copy link
Collaborator Author

vfdev-5 commented Oct 8, 2020

Closed via #1376

@vfdev-5 vfdev-5 closed this as completed Oct 8, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment