Show a warning if current device index is lower than current local rank #1308

vfdev-5 · 2020-09-22T08:09:36Z

🚀 Feature

Following #1307, if user does not set torch.cuda.set_device("cuda:lrank"), ignite's code

ignite/ignite/distributed/comp_models/native.py

Lines 99 to 102 in 0c41778

    
           def _compute_nproc_per_node(self): 
        
               tensor = torch.tensor([self.get_local_rank() + 1]).to(self.device()) 
        
               dist.all_reduce(tensor, op=dist.ReduceOp.MAX) 
        
               return tensor.item()

will use the same device cuda:0 for all_reduce op.

For older NCCL, it will setup itself such that i-th proc uses cuda:0 device and thus following collective op will hang with other devices. For example

import torch
import torch.distributed as dist

def main():

    # !!! We do not call torch.cuda.set_device("cuda:lrank")

    dist.init_process_group(backend="nccl", init_method="env://")
    import os
    local_rank = int(os.environ["LOCAL_RANK"])

    tensor = torch.tensor([local_rank + 1]).to("cuda")
    dist.all_reduce(tensor, op=dist.ReduceOp.MAX)
    print(tensor)

    tensor = torch.tensor([local_rank + 1]).to("cuda:{}".format(local_rank))
    # PROGRAM WILL HANG HERE >>>>
    dist.all_reduce(tensor, op=dist.ReduceOp.MAX)
    print(tensor)

    dist.destroy_process_group()

if __name__ == "__main__":
    main()

For newer NCCL, it raises the error as in #1307.

Let's improve the code by raising a warning for native and horovod dist models when calling idist.device() if we encounter the situation where current cuda device index is smaller than the local rank.
PyTorch docs suggest that to use 1 proc per 1 cuda device => local rank should be equal to cuda device index.
However, it is also possible to have M procs with K devices / proc (e.g. 4 procs with 2 GPUs per proc) => local rank <= cuda device index.

The text was updated successfully, but these errors were encountered:

HelioStrike · 2020-10-02T03:12:18Z

@vfdev-5 So the following code needs to be added in native.py and horovod.py.

if index < self.get_local_rank():
    warnings.warn("Current device index is less than current local rank.")

Is there anything else that needs to be done? Made a PR for the changes so far.

vfdev-5 · 2020-10-08T22:41:45Z

Closed via #1376

vfdev-5 added the enhancement label Sep 22, 2020

vfdev-5 mentioned this issue Sep 22, 2020

distributed training compatible issue in ignite 0.4.2 #1307

Closed

HelioStrike mentioned this issue Oct 2, 2020

warning if current device index is lower than current local rank #1335

Merged

3 tasks

vfdev-5 assigned HelioStrike Oct 4, 2020

vfdev-5 mentioned this issue Oct 8, 2020

warning if current device index is lower than current local rank (#1335) #1376

Merged

3 tasks

vfdev-5 closed this as completed Oct 8, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Show a warning if current device index is lower than current local rank #1308

Show a warning if current device index is lower than current local rank #1308

vfdev-5 commented Sep 22, 2020

HelioStrike commented Oct 2, 2020 •

edited

Loading

vfdev-5 commented Oct 8, 2020

Show a warning if current device index is lower than current local rank #1308

Show a warning if current device index is lower than current local rank #1308

Comments

vfdev-5 commented Sep 22, 2020

🚀 Feature

HelioStrike commented Oct 2, 2020 • edited Loading

vfdev-5 commented Oct 8, 2020

HelioStrike commented Oct 2, 2020 •

edited

Loading