[SGD] v2 prototype: `BackendExecutor` and `TorchBackend` implementation #17357

amogkam · 2021-07-27T07:07:45Z

Why are these changes needed?

Related issue number

Checks

I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Co-authored-by: Richard Liaw <[email protected]>

…into sgd-v2-prototyp-actor-group

…into sgd-v2-prototype-executor

python/ray/util/sgd/v2/backends/torch.py

python/ray/util/sgd/v2/backends/backend.py

matthewdeng · 2021-07-27T16:29:41Z

python/ray/util/sgd/v2/backends/backend.py

+            "This Trainer is not active. It is either shutdown already or "
+            "never started in the first place. Either create a new Trainer "
+            "or start this one.")


Trainer -> BackendExecutor

The user isn't aware of BackendExecutor right?

python/ray/util/sgd/v2/backends/torch.py

python/ray/util/sgd/v2/backends/backend.py

…v2-prototype-executor

richardliaw · 2021-07-28T05:24:04Z

There's a lot of state that gets tossed around between Backend, BackendExecutor, TorchConfig.

Can we instead reduce the places where we're keeping track of state?

here's one attempt. the tldr is that TorchBackend becomes a purely "functional" callback and doesn't need to hold state itself.

diff --git a/python/ray/util/sgd/v2/backends/backend.py b/python/ray/util/sgd/v2/backends/backend.py
index 99ea5a20b..734425c0c 100644
--- a/python/ray/util/sgd/v2/backends/backend.py
+++ b/python/ray/util/sgd/v2/backends/backend.py
@@ -59,12 +59,14 @@ class BackendExecutor:
         self._num_gpus_per_worker = num_gpus_per_worker
 
         self.worker_group = DeactivatedWorkerGroup()
+        self._backend = get_backend(self._backend_config)
 
     def start(self):
         """Starts the worker group."""
         self.worker_group = WorkerGroup(self._num_workers,
                                         self._num_cpus_per_worker,
                                         self._num_gpus_per_worker)
+        self._backend.on_start(self.worker_group, self.backend_config)
 
     def execute(self, train_func: Callable[[], T]) -> Iterator[Any]:
         """Executes training function on all workers and yield results.
@@ -156,6 +158,7 @@ class BackendExecutor:
 
     def shutdown(self):
         """Shuts down the workers in the worker group."""
+        self._backend.on_shutdown(self.worker_group)
         self.worker_group.shutdown()
         self.worker_group = DeactivatedWorkerGroup()
 
diff --git a/python/ray/util/sgd/v2/backends/torch.py b/python/ray/util/sgd/v2/backends/torch.py
index 53d72e098..6170f2b94 100644
--- a/python/ray/util/sgd/v2/backends/torch.py
+++ b/python/ray/util/sgd/v2/backends/torch.py
@@ -91,57 +91,57 @@ def shutdown_torch():
         torch.cuda.empty_cache()
 
 
-class TorchExecutor(BackendExecutor):
-    def __init__(self, *args, **kwargs):
-        super().__init__(*args, **kwargs)
-        self._backend_config.validate(name="torch")
-
-        if self._backend_config.backend is None:
-            if self._num_gpus_per_worker > 0:
-                self.backend = "nccl"
-            else:
-                self.backend = "gloo"
-
-    def start(self):
-        super().start()
-        if self._num_workers > 1:
-
+class TorchBackend:
+    # def __init__(self, backend_config, use_gpu=False):
+    #     self._backend_config = backend_config
+
+    #     # Can we actually just resolve this in config dataclass?
+    #     self._backend_config.validate(name="torch")
+
+    #     # Can we actually just resolve this too in config dataclass?
+    #     if self._backend_config.backend is None:
+    #         if use_gpu:
+    #             self.backend = "nccl"
+    #         else:
+    #             self.backend = "gloo"
+
+    def on_start(worker_group, backend_config):
+        if len(worker_group) > 1:
             def get_address():
                 addr = ray.util.get_node_ip_address()
                 port = find_free_port()
                 return addr, port
 
-            master_addr, master_port = self.worker_group.execute_single(
+            master_addr, master_port = worker_group.execute_single(
                 0, get_address)
 
-            if self._backend_config.init_method == "env":
+            if backend_config.init_method == "env":
 
                 def set_env_vars(addr, port):
                     os.environ["MASTER_ADDR"] = addr
                     os.environ["MASTER_PORT"] = str(port)
 
-                self.worker_group.execute(
+                worker_group.execute(
                     set_env_vars, addr=master_addr, port=master_port)
                 url = "env://"
-            elif self._backend_config == "tcp":
+            elif backend_config == "tcp":
                 url = f"tcp://{master_addr}:{master_port}"
             else:
                 raise ValueError(
                     f"The provided init_method ("
-                    f"{self._backend_config.init_method} is not supported.")
+                    f"{backend_config.init_method} is not supported.")
 
-            for i in range(len(self.worker_group)):
-                self.worker_group.execute_single(
+            for i in range(len(worker_group)):
+                worker_group.execute_single(
                     i,
                     setup_torch_process_group,
-                    backend=self.backend,
+                    backend=self._backend_config.backend,
                     world_rank=i,
-                    world_size=len(self.worker_group),
+                    world_size=len(worker_group),
                     init_method=url,
-                    timeout_s=self._backend_config.timeout_s)
+                    timeout_s=backend_config.timeout_s)
 
-    def shutdown(self):
-        self.worker_group.execute_single(
+    def on_shutdown(worker_group):
+        worker_group.execute_single(
             0, torch.distributed.destroy_process_group)
-        self.worker_group.execute(shutdown_torch)
-        super().shutdown()
+        worker_group.execute(shutdown_torch)

amogkam · 2021-07-28T23:30:05Z

Ok I addressed all the comments and removed all the reporting functionality. It would be great if you guys could take another look.

python/ray/util/sgd/v2/backends/torch.py

python/ray/util/sgd/v2/backends/backend.py

python/ray/util/sgd/v2/backends/torch.py

matthewdeng · 2021-07-29T00:08:04Z

python/ray/util/sgd/v2/tests/test_backend.py

+@pytest.mark.parametrize("init_method", ["env", "tcp"])
+def test_torch_start_shutdown(ray_start_2_cpus, init_method):
+    torch_config = TorchConfig(backedn="gloo", init_method=init_method)
+    e = TorchExecutor(torch_config, num_workers=2)
+
+    def check_process_group():
+        import torch
+        return torch.distributed.is_initialized(
+        ) and torch.distributed.get_world_size() == 2
+
+    assert all(e.run(check_process_group))
+
+    e._backend.on_shutdown(e.worker_group, e._backend_config)
+
+    assert not any(e.run(check_process_group))


Can we move tests for individual backends into their own test files?

python/ray/util/sgd/v2/backends/torch.py

python/ray/util/sgd/v2/backends/backend.py

python/ray/util/sgd/v2/backends/torch.py

richardliaw

looks good! merge when tests pass.

…v2-prototype-executor

…to sgd-v2-prototype-executor

…on (ray-project#17357) * wip * formatting * increase timeouts * wip * address comments * comments * fix * address comments * Update python/ray/util/sgd/v2/worker_group.py Co-authored-by: Richard Liaw <[email protected]> * Update python/ray/util/sgd/v2/worker_group.py Co-authored-by: Richard Liaw <[email protected]> * address comments * formatting * fix * wip * finish * fix * formatting * remove reporting * split TorchBackend * fix tests * address comments * add file * more fixes * remove default value * update run method doc * add comment * minor doc fixes * lint * add args to BaseWorker.execute * address comments * remove extra parentheses * properly instantiate backend * fix some of the tests * fix torch setup * fix type hint Co-authored-by: Richard Liaw <[email protected]> Co-authored-by: matthewdeng <[email protected]>

amogkam and others added 17 commits July 25, 2021 18:13

wip

7a1a7a7

formatting

ed876e5

increase timeouts

5dada84

wip

769c1d6

address comments

fc25542

comments

15562a7

fix

c983369

address comments

fc3f756

Update python/ray/util/sgd/v2/worker_group.py

6a36027

Co-authored-by: Richard Liaw <[email protected]>

Update python/ray/util/sgd/v2/worker_group.py

99b717b

Co-authored-by: Richard Liaw <[email protected]>

address comments

c836d84

Merge branch 'sgd-v2-prototyp-actor-group' of github.com:amogkam/ray …

b8f8ae5

…into sgd-v2-prototyp-actor-group

formatting

ce412c5

fix

7bd38d7

wip

7fb1098

Merge branch 'sgd-v2-prototyp-actor-group' of github.com:amogkam/ray …

12211b2

…into sgd-v2-prototype-executor

finish

e300bed

amogkam assigned matthewdeng and richardliaw Jul 27, 2021

fix

de84dc4

richardliaw reviewed Jul 27, 2021

View reviewed changes

python/ray/util/sgd/v2/backends/torch.py Outdated Show resolved Hide resolved

matthewdeng reviewed Jul 27, 2021

View reviewed changes

amogkam added 2 commits July 27, 2021 17:41

Merge branch 'master' of https://github.com/ray-project/ray into sgd-…

1b7be63

…v2-prototype-executor

formatting

469dc6d

amogkam added 4 commits July 28, 2021 14:25

remove reporting

aa22684

split TorchBackend

d19bfbf

fix tests

279fdcf

address comments

66a84e3

amogkam requested a review from matthewdeng July 28, 2021 23:29

add file

0ec067c

amogkam commented Jul 28, 2021

View reviewed changes

python/ray/util/sgd/v2/backends/torch.py Show resolved Hide resolved

more fixes

a4e1b8e

matthewdeng reviewed Jul 29, 2021

View reviewed changes

matthewdeng added 3 commits July 28, 2021 17:11

remove default value

266dd06

update run method doc

27b2f29

add comment

14f8fe3

matthewdeng reviewed Jul 29, 2021

View reviewed changes

python/ray/util/sgd/v2/backends/torch.py Outdated Show resolved Hide resolved

matthewdeng reviewed Jul 29, 2021

View reviewed changes

python/ray/util/sgd/v2/backends/backend.py Outdated Show resolved Hide resolved

minor doc fixes

903a3d8

matthewdeng approved these changes Jul 29, 2021

View reviewed changes

amogkam and others added 2 commits July 28, 2021 18:14

lint

f95cc34

add args to BaseWorker.execute

ce19766

richardliaw reviewed Jul 29, 2021

View reviewed changes

python/ray/util/sgd/v2/backends/backend.py Outdated Show resolved Hide resolved

richardliaw reviewed Jul 29, 2021

View reviewed changes

python/ray/util/sgd/v2/backends/torch.py Outdated Show resolved Hide resolved

richardliaw approved these changes Jul 29, 2021

View reviewed changes

matthewdeng and others added 6 commits July 28, 2021 22:01

address comments

5588102

remove extra parentheses

499aa81

properly instantiate backend

0d340fb

fix some of the tests

b6f79c5

Merge branch 'master' of https://github.com/ray-project/ray into sgd-…

9e54905

…v2-prototype-executor

Merge branch 'sgd-v2-prototype-executor' of github.com:amogkam/ray in…

3d9245a

…to sgd-v2-prototype-executor

amogkam changed the title ~~[SGD] v2 prototype: BackendExecutor and TorchExecutor implementation~~ [SGD] v2 prototype: BackendExecutor and TorchBackend implementation Jul 29, 2021

amogkam added 2 commits July 29, 2021 08:26

fix torch setup

1d0f813

fix type hint

c715bdd

matthewdeng mentioned this pull request Jul 29, 2021

[SGD] add SGDv2 Trainer prototype implementation #17440

Merged

6 tasks

amogkam merged commit ff04a92 into ray-project:master Jul 29, 2021

richardliaw added this to the SGD v2 milestone Aug 3, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SGD] v2 prototype: `BackendExecutor` and `TorchBackend` implementation #17357

[SGD] v2 prototype: `BackendExecutor` and `TorchBackend` implementation #17357

amogkam commented Jul 27, 2021

matthewdeng Jul 27, 2021

amogkam Jul 28, 2021

richardliaw commented Jul 28, 2021 •

edited

Loading

amogkam commented Jul 28, 2021

matthewdeng Jul 29, 2021

richardliaw left a comment

[SGD] v2 prototype: BackendExecutor and TorchBackend implementation #17357

[SGD] v2 prototype: BackendExecutor and TorchBackend implementation #17357

Conversation

amogkam commented Jul 27, 2021

Why are these changes needed?

Related issue number

Checks

matthewdeng Jul 27, 2021

Choose a reason for hiding this comment

amogkam Jul 28, 2021

Choose a reason for hiding this comment

richardliaw commented Jul 28, 2021 • edited Loading

amogkam commented Jul 28, 2021

matthewdeng Jul 29, 2021

Choose a reason for hiding this comment

richardliaw left a comment

Choose a reason for hiding this comment

[SGD] v2 prototype: `BackendExecutor` and `TorchBackend` implementation #17357

[SGD] v2 prototype: `BackendExecutor` and `TorchBackend` implementation #17357

richardliaw commented Jul 28, 2021 •

edited

Loading