Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SGD] v2 prototype: BackendExecutor and TorchBackend implementation #17357

Merged
merged 40 commits into from
Jul 29, 2021

Conversation

amogkam
Copy link
Contributor

@amogkam amogkam commented Jul 27, 2021

Why are these changes needed?

Related issue number

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

python/ray/util/sgd/v2/backends/backend.py Outdated Show resolved Hide resolved
Comment on lines 180 to 182
"This Trainer is not active. It is either shutdown already or "
"never started in the first place. Either create a new Trainer "
"or start this one.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trainer -> BackendExecutor

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The user isn't aware of BackendExecutor right?

python/ray/util/sgd/v2/backends/torch.py Outdated Show resolved Hide resolved
python/ray/util/sgd/v2/backends/torch.py Show resolved Hide resolved
python/ray/util/sgd/v2/backends/torch.py Outdated Show resolved Hide resolved
python/ray/util/sgd/v2/backends/backend.py Show resolved Hide resolved
python/ray/util/sgd/v2/backends/backend.py Outdated Show resolved Hide resolved
python/ray/util/sgd/v2/backends/backend.py Outdated Show resolved Hide resolved
python/ray/util/sgd/v2/backends/backend.py Outdated Show resolved Hide resolved
python/ray/util/sgd/v2/backends/backend.py Outdated Show resolved Hide resolved
@richardliaw
Copy link
Contributor

richardliaw commented Jul 28, 2021

There's a lot of state that gets tossed around between Backend, BackendExecutor, TorchConfig.

Can we instead reduce the places where we're keeping track of state?

here's one attempt. the tldr is that TorchBackend becomes a purely "functional" callback and doesn't need to hold state itself.

diff --git a/python/ray/util/sgd/v2/backends/backend.py b/python/ray/util/sgd/v2/backends/backend.py
index 99ea5a20b..734425c0c 100644
--- a/python/ray/util/sgd/v2/backends/backend.py
+++ b/python/ray/util/sgd/v2/backends/backend.py
@@ -59,12 +59,14 @@ class BackendExecutor:
         self._num_gpus_per_worker = num_gpus_per_worker
 
         self.worker_group = DeactivatedWorkerGroup()
+        self._backend = get_backend(self._backend_config)
 
     def start(self):
         """Starts the worker group."""
         self.worker_group = WorkerGroup(self._num_workers,
                                         self._num_cpus_per_worker,
                                         self._num_gpus_per_worker)
+        self._backend.on_start(self.worker_group, self.backend_config)
 
     def execute(self, train_func: Callable[[], T]) -> Iterator[Any]:
         """Executes training function on all workers and yield results.
@@ -156,6 +158,7 @@ class BackendExecutor:
 
     def shutdown(self):
         """Shuts down the workers in the worker group."""
+        self._backend.on_shutdown(self.worker_group)
         self.worker_group.shutdown()
         self.worker_group = DeactivatedWorkerGroup()
 
diff --git a/python/ray/util/sgd/v2/backends/torch.py b/python/ray/util/sgd/v2/backends/torch.py
index 53d72e098..6170f2b94 100644
--- a/python/ray/util/sgd/v2/backends/torch.py
+++ b/python/ray/util/sgd/v2/backends/torch.py
@@ -91,57 +91,57 @@ def shutdown_torch():
         torch.cuda.empty_cache()
 
 
-class TorchExecutor(BackendExecutor):
-    def __init__(self, *args, **kwargs):
-        super().__init__(*args, **kwargs)
-        self._backend_config.validate(name="torch")
-
-        if self._backend_config.backend is None:
-            if self._num_gpus_per_worker > 0:
-                self.backend = "nccl"
-            else:
-                self.backend = "gloo"
-
-    def start(self):
-        super().start()
-        if self._num_workers > 1:
-
+class TorchBackend:
+    # def __init__(self, backend_config, use_gpu=False):
+    #     self._backend_config = backend_config
+
+    #     # Can we actually just resolve this in config dataclass?
+    #     self._backend_config.validate(name="torch")
+
+    #     # Can we actually just resolve this too in config dataclass?
+    #     if self._backend_config.backend is None:
+    #         if use_gpu:
+    #             self.backend = "nccl"
+    #         else:
+    #             self.backend = "gloo"
+
+    def on_start(worker_group, backend_config):
+        if len(worker_group) > 1:
             def get_address():
                 addr = ray.util.get_node_ip_address()
                 port = find_free_port()
                 return addr, port
 
-            master_addr, master_port = self.worker_group.execute_single(
+            master_addr, master_port = worker_group.execute_single(
                 0, get_address)
 
-            if self._backend_config.init_method == "env":
+            if backend_config.init_method == "env":
 
                 def set_env_vars(addr, port):
                     os.environ["MASTER_ADDR"] = addr
                     os.environ["MASTER_PORT"] = str(port)
 
-                self.worker_group.execute(
+                worker_group.execute(
                     set_env_vars, addr=master_addr, port=master_port)
                 url = "env://"
-            elif self._backend_config == "tcp":
+            elif backend_config == "tcp":
                 url = f"tcp://{master_addr}:{master_port}"
             else:
                 raise ValueError(
                     f"The provided init_method ("
-                    f"{self._backend_config.init_method} is not supported.")
+                    f"{backend_config.init_method} is not supported.")
 
-            for i in range(len(self.worker_group)):
-                self.worker_group.execute_single(
+            for i in range(len(worker_group)):
+                worker_group.execute_single(
                     i,
                     setup_torch_process_group,
-                    backend=self.backend,
+                    backend=self._backend_config.backend,
                     world_rank=i,
-                    world_size=len(self.worker_group),
+                    world_size=len(worker_group),
                     init_method=url,
-                    timeout_s=self._backend_config.timeout_s)
+                    timeout_s=backend_config.timeout_s)
 
-    def shutdown(self):
-        self.worker_group.execute_single(
+    def on_shutdown(worker_group):
+        worker_group.execute_single(
             0, torch.distributed.destroy_process_group)
-        self.worker_group.execute(shutdown_torch)
-        super().shutdown()
+        worker_group.execute(shutdown_torch)

@amogkam amogkam requested a review from matthewdeng July 28, 2021 23:29
@amogkam
Copy link
Contributor Author

amogkam commented Jul 28, 2021

Ok I addressed all the comments and removed all the reporting functionality. It would be great if you guys could take another look.

python/ray/util/sgd/v2/backends/backend.py Outdated Show resolved Hide resolved
python/ray/util/sgd/v2/backends/backend.py Show resolved Hide resolved
python/ray/util/sgd/v2/backends/torch.py Outdated Show resolved Hide resolved
Comment on lines 71 to 85
@pytest.mark.parametrize("init_method", ["env", "tcp"])
def test_torch_start_shutdown(ray_start_2_cpus, init_method):
torch_config = TorchConfig(backedn="gloo", init_method=init_method)
e = TorchExecutor(torch_config, num_workers=2)

def check_process_group():
import torch
return torch.distributed.is_initialized(
) and torch.distributed.get_world_size() == 2

assert all(e.run(check_process_group))

e._backend.on_shutdown(e.worker_group, e._backend_config)

assert not any(e.run(check_process_group))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move tests for individual backends into their own test files?

Copy link
Contributor

@richardliaw richardliaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good! merge when tests pass.

@amogkam amogkam changed the title [SGD] v2 prototype: BackendExecutor and TorchExecutor implementation [SGD] v2 prototype: BackendExecutor and TorchBackend implementation Jul 29, 2021
@amogkam amogkam merged commit ff04a92 into ray-project:master Jul 29, 2021
stephanie-wang pushed a commit to stephanie-wang/ray that referenced this pull request Jul 31, 2021
…on (ray-project#17357)

* wip

* formatting

* increase timeouts

* wip

* address comments

* comments

* fix

* address comments

* Update python/ray/util/sgd/v2/worker_group.py

Co-authored-by: Richard Liaw <[email protected]>

* Update python/ray/util/sgd/v2/worker_group.py

Co-authored-by: Richard Liaw <[email protected]>

* address comments

* formatting

* fix

* wip

* finish

* fix

* formatting

* remove reporting

* split TorchBackend

* fix tests

* address comments

* add file

* more fixes

* remove default value

* update run method doc

* add comment

* minor doc fixes

* lint

* add args to BaseWorker.execute

* address comments

* remove extra parentheses

* properly instantiate backend

* fix some of the tests

* fix torch setup

* fix type hint

Co-authored-by: Richard Liaw <[email protected]>
Co-authored-by: matthewdeng <[email protected]>
@richardliaw richardliaw added this to the SGD v2 milestone Aug 3, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants