Add util to create a torch ddp process group for a list of workers. #34202

gjoliver · 2023-04-10T06:37:46Z

Why are these changes needed?

For running DeepSpeed jobs.

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- [*] Unit tests
- Release tests
- This PR is not tested :(

amogkam · 2023-04-10T17:43:14Z

Is it possible to consolidate on a singular code path?

gjoliver · 2023-04-10T18:17:40Z

Is it possible to consolidate on a singular code path?

this goal is to not share code with Train, and make it a generic AIR util for now.
this will be used for llm training serving, where each Trial / Replica is a ddp group.
so there may be things that are different from Train.
what do you think?

jovany-wang

3 high level questions:

What's the difference between torch ddp in ray train?
How can we distribute workers onto different nodes?
what's the whole pipeline on integrating deepspeed?

python/ray/air/util/torch_dist.py

gjoliver · 2023-04-12T02:08:46Z

3 high level questions:

What's the difference between torch ddp in ray train?

How can we distribute workers onto different nodes?

what's the whole pipeline on integrating deepspeed?

Pretty much no difference.
I will cc you on the example notebook, and you will see how we integrate DeepSpeed in this case.

fecet · 2023-04-16T06:06:46Z

Hello, I'm really impressed with this feature! Would it be possible for me to obtain a copy of the example notebook?

python/ray/air/util/torch_dist.py

Yard1 · 2023-04-17T18:26:07Z

python/ray/air/util/torch_dist.py

+    # Wait for all workers to join the process group.
+    ray.get(setup_futures)
+
+    return local_ranks


should we also return a dict mapping IPs to world ranks?

I didn't find the need because I want to use the index of the worker as their global rank.
do you see any flaw in this approach?

Yard1

lgtm, one question

amogkam

I would really like to unify on a common code path. All the differences between Ray train logic and this logic can be abstracted away via function arguments.

A common utility function for sharing cuda visible devices that can handle multiple gpu per worker case as well.
A common utility function for getting the local rank, local world size, and node rank
TorchBackend.on_start should call the init_torch_dist_process_group function

amogkam · 2023-04-17T19:58:39Z

python/ray/air/util/torch_dist.py

+    local_world_size: int,
+    master_addr: str,
+    master_port: str,
+    gpu_ids: List[int],


this may not be List[int] if using multiple GPUs per worker.

added logic to flatten the gpu ids for multiple workers.
changed the unit test to include this case too.

amogkam · 2023-04-17T20:00:13Z

python/ray/air/util/torch_dist.py

+    # All the workers on a specific node.
+    node_to_workers = {}
+    # All the gpu ids visible to all the workers on a specific node.
+    node_to_gpu_ids = {}


you can do defaultdict(list) instead of needing to do setdefault every time.

amogkam · 2023-04-17T20:04:01Z

python/ray/air/util/torch_dist.py

+            return func(*args, **kwargs)
+        except Exception as e:
+            skipped = skip_exceptions(e)
+            raise skipped from exception_cause(skipped)


we want to remove all this skip_exceptions stuff? It only works when used with Train/Tune

ah, good to know. thanks for the comment.

amogkam · 2023-04-17T20:07:19Z

python/ray/air/util/torch_dist.py

+    return node_id, gpu_ids
+
+
+def init_torch_dist_process_group(


also need to add corresponding shutdown logic?

good point. done.

amogkam · 2023-04-17T20:11:52Z

python/ray/air/util/torch_dist.py

+    return node_id, gpu_ids
+
+
+def init_torch_dist_process_group(


is this intended to be public API? If not, then let's move it into _internal package?

this will be included in our public facing examples.
if we move those predictors into ray/air/, we can mark this private, or get rid of this altogether and refactor Train to share this logic.

Yard1 · 2023-04-17T20:28:19Z

python/ray/air/util/torch_dist.py

+        raise RuntimeError("Distributed torch is not available.")
+
+    # Build a map from node_id to workers on that node.
+    node_and_gpu_ids = ray.get(


would it be possible to make sure that we sort the workers by gpu id to avoid the issue fixed in #33159?

am using a set to collect per-node visible GPUs now.
list(set) will always be sorted.
added a comment about this.

Signed-off-by: Jun Gong <[email protected]>

gjoliver

thanks for all the comments.
ptal.

gjoliver · 2023-04-18T21:22:23Z

python/ray/air/util/torch_dist.py

+            return func(*args, **kwargs)
+        except Exception as e:
+            skipped = skip_exceptions(e)
+            raise skipped from exception_cause(skipped)


ah, good to know. thanks for the comment.

gjoliver · 2023-04-18T21:29:14Z

python/ray/air/util/torch_dist.py

+    return node_id, gpu_ids
+
+
+def init_torch_dist_process_group(


this will be included in our public facing examples.
if we move those predictors into ray/air/, we can mark this private, or get rid of this altogether and refactor Train to share this logic.

gjoliver · 2023-04-18T21:31:10Z

python/ray/air/util/torch_dist.py

+    # Wait for all workers to join the process group.
+    ray.get(setup_futures)
+
+    return local_ranks


I didn't find the need because I want to use the index of the worker as their global rank.
do you see any flaw in this approach?

gjoliver · 2023-04-18T21:40:32Z

python/ray/air/util/torch_dist.py

+    local_world_size: int,
+    master_addr: str,
+    master_port: str,
+    gpu_ids: List[int],


added logic to flatten the gpu ids for multiple workers.
changed the unit test to include this case too.

gjoliver · 2023-04-18T21:44:10Z

python/ray/air/util/torch_dist.py

+        raise RuntimeError("Distributed torch is not available.")
+
+    # Build a map from node_id to workers on that node.
+    node_and_gpu_ids = ray.get(


am using a set to collect per-node visible GPUs now.
list(set) will always be sorted.
added a comment about this.

gjoliver · 2023-04-18T21:44:31Z

python/ray/air/util/torch_dist.py

+    # All the workers on a specific node.
+    node_to_workers = {}
+    # All the gpu ids visible to all the workers on a specific node.
+    node_to_gpu_ids = {}


gjoliver · 2023-04-18T21:57:09Z

python/ray/air/util/torch_dist.py

+    return node_id, gpu_ids
+
+
+def init_torch_dist_process_group(


good point. done.

gjoliver · 2023-04-19T05:34:52Z

unit tests all pass now. lint error is not related.
@amogkam can I get your blessing? thanks.

python/ray/air/util/torch_dist.py

Co-authored-by: Amog Kamsetty <[email protected]> Signed-off-by: Jun Gong <[email protected]>

…ers. (ray-project#34202) Signed-off-by: Jun Gong <[email protected]> Signed-off-by: elliottower <[email protected]>

…ers. (ray-project#34202) Signed-off-by: Jun Gong <[email protected]> Signed-off-by: Jack He <[email protected]>

…ers. (ray-project#34202) Signed-off-by: Jun Gong <[email protected]>

gjoliver requested a review from amogkam April 10, 2023 18:15

amogkam self-assigned this Apr 11, 2023

jovany-wang reviewed Apr 11, 2023

View reviewed changes

python/ray/air/util/torch_dist.py Outdated Show resolved Hide resolved

gjoliver force-pushed the ddp-process-group branch from 161c970 to 7474e97 Compare April 12, 2023 23:50

Yard1 reviewed Apr 17, 2023

View reviewed changes

python/ray/air/util/torch_dist.py Outdated Show resolved Hide resolved

Yard1 reviewed Apr 17, 2023

View reviewed changes

Yard1 approved these changes Apr 17, 2023

View reviewed changes

amogkam reviewed Apr 17, 2023

View reviewed changes

Yard1 reviewed Apr 17, 2023

View reviewed changes

Jun Gong added 8 commits April 18, 2023 09:59

Add util to create a torch ddp process group for a list of workers.

a9a6671

Signed-off-by: Jun Gong <[email protected]>

lint

763ee72

Signed-off-by: Jun Gong <[email protected]>

fix gpu ids

ebee8a7

Signed-off-by: Jun Gong <[email protected]>

Fix doc string

a0bb046

Signed-off-by: Jun Gong <[email protected]>

add bazel build rule for torch_dist test.

781593f

Signed-off-by: Jun Gong <[email protected]>

add nccl test

1741e3b

Signed-off-by: Jun Gong <[email protected]>

return local rank of workers

1c8cf03

Signed-off-by: Jun Gong <[email protected]>

fix gpu test

e680053

Signed-off-by: Jun Gong <[email protected]>

gjoliver force-pushed the ddp-process-group branch from 8436314 to e680053 Compare April 18, 2023 17:49

Jun Gong added 3 commits April 18, 2023 11:26

fix build

3e32274

Signed-off-by: Jun Gong <[email protected]>

fix

e0b3ef4

Signed-off-by: Jun Gong <[email protected]>

review

52c5a4f

Signed-off-by: Jun Gong <[email protected]>

gjoliver commented Apr 18, 2023

View reviewed changes

amogkam approved these changes Apr 19, 2023

View reviewed changes

python/ray/air/util/torch_dist.py Outdated Show resolved Hide resolved

Update python/ray/air/util/torch_dist.py

377dcb9

Co-authored-by: Amog Kamsetty <[email protected]> Signed-off-by: Jun Gong <[email protected]>

gjoliver merged commit a0255e5 into ray-project:master Apr 19, 2023

elliottower pushed a commit to elliottower/ray that referenced this pull request Apr 22, 2023

[AIR] Add util to create a torch ddp process group for a list of work…

6a614fa

…ers. (ray-project#34202) Signed-off-by: Jun Gong <[email protected]> Signed-off-by: elliottower <[email protected]>

architkulkarni pushed a commit to architkulkarni/ray that referenced this pull request May 16, 2023

[AIR] Add util to create a torch ddp process group for a list of work…

1b6a892

…ers. (ray-project#34202) Signed-off-by: Jun Gong <[email protected]>

matthewdeng mentioned this pull request Jun 21, 2023

[Doc] Add Distributed Testing Example for pl.Trainer.test() #36395

Merged

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add util to create a torch ddp process group for a list of workers. #34202

Add util to create a torch ddp process group for a list of workers. #34202

gjoliver commented Apr 10, 2023

amogkam commented Apr 10, 2023

gjoliver commented Apr 10, 2023

jovany-wang left a comment

gjoliver commented Apr 12, 2023

fecet commented Apr 16, 2023 •

edited

Loading

Yard1 Apr 17, 2023

gjoliver Apr 18, 2023

Yard1 left a comment

amogkam left a comment •

edited

Loading

amogkam Apr 17, 2023

gjoliver Apr 18, 2023

amogkam Apr 17, 2023

gjoliver Apr 18, 2023

amogkam Apr 17, 2023

gjoliver Apr 18, 2023

amogkam Apr 17, 2023 •

edited

Loading

gjoliver Apr 18, 2023

amogkam Apr 17, 2023

gjoliver Apr 18, 2023

Yard1 Apr 17, 2023

gjoliver Apr 18, 2023

gjoliver left a comment

gjoliver Apr 18, 2023

gjoliver Apr 18, 2023

gjoliver Apr 18, 2023

gjoliver Apr 18, 2023

gjoliver Apr 18, 2023

gjoliver Apr 18, 2023

gjoliver Apr 18, 2023

gjoliver commented Apr 19, 2023

Add util to create a torch ddp process group for a list of workers. #34202

Add util to create a torch ddp process group for a list of workers. #34202

Conversation

gjoliver commented Apr 10, 2023

Why are these changes needed?

Related issue number

Checks

amogkam commented Apr 10, 2023

gjoliver commented Apr 10, 2023

jovany-wang left a comment

Choose a reason for hiding this comment

gjoliver commented Apr 12, 2023

fecet commented Apr 16, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Yard1 left a comment

Choose a reason for hiding this comment

amogkam left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amogkam Apr 17, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gjoliver left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gjoliver commented Apr 19, 2023

fecet commented Apr 16, 2023 •

edited

Loading

amogkam left a comment •

edited

Loading

amogkam Apr 17, 2023 •

edited

Loading