[train] wrap BackendExecutor in ray.remote() #20123

matthewdeng · 2021-11-06T01:47:15Z

Why are these changes needed?

This PR allows Ray Train to be run with Ray Client. Wrapping the BackendExecutor allows the primary execution to occur on the cluster, while Trainer remains on the driver.

Changes

Wrap BackendExecutor in ray.remote().
1. Use 0 CPUs to allow scheduling.
2. Call force_on_current_node to ensure this is scheduled on the head node in Ray Client mode.
3. Convert calls from self._executor.XYZ() to ray.get(self._executor.XYZ.remote()). Wrapping in ray.get allows for synchronous execution.
Move checkpoint_manager from BackendExecutor to Trainer. For Ray Client mode, persisted checkpoints will be written to disk of the driver.
1. Directly pass TrainingResults up from BackendExecutor to Trainer. The Trainer will then process the report/checkpoint results within _fetch_next_result.
2. Move finish_training from BackendExecutor to Trainer. This allows the Trainer to call the CheckpointManager while flushing the result queue.
Add Tune placement group bundle for BackendExecutor.
Moves some tests from test_trainer.py into a new test_examples.py.
1. Without this, test_trainer exceeds the 900 second timeout.

Related issue number

Checks

I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

doc/examples/datasets_train/datasets_train.py

python/ray/train/trainer.py

…train-client

…n-client

python/ray/train/trainer.py

doc/examples/datasets_train/datasets_train.py

python/ray/train/checkpoint.py

python/ray/train/examples/train_linear_example.py

python/ray/train/tests/test_backend.py

python/ray/train/tests/test_trainer.py

python/ray/train/tests/test_backend.py

python/ray/train/backends/backend.py

python/ray/train/trainer.py

amogkam · 2021-11-12T04:00:14Z

python/ray/train/trainer.py

            return next_results

+    def _fetch_next_result(self) -> Optional[List[Dict]]:


With these changes, the "processing" of TrainingResults are happening in different places: reports are processed (yielded) in next, while checkpoints are processed in either _fetch_next_result or _finish_checkpointing.

Is it possible to refactor this a bit so that all of the processing is happening in one place?

One possible implementation:

So we can have one method that just obtains the next TrainingResult from the actor:

def get_next_result(self) -> Optional[List[TrainingResult]]

And then we can move most of the logic in next (or use some other helper method)

def _get_next_result(self) -> Optional[List[TrainingResult]]: results = ray.get(self._executor.get_next_results.remote()) return results def _pause_reporting(self): ray.get(self._executor.pause_reporting.remote()) def _finish_training(self): # Assumes that all reporting and checkpointing are already finished. ray.get(self._executor.finish_training.remote()) def __next__(self): while True: if self.is_finished(): raise StopIteration next_results = self._run_with_error_handling(self._get_next_result) if next_results is None: # There are no more reports or checkpoints. # So we don't need to pause reporting here. try: self._final_results = self._run_with_error_handling( self._finish_training) finally: self._finished_training = True else: first_result = next_results[0] result_type = first_result.type if result_type is TrainingResultType.REPORT: result_data = [r.data for r in results] yield result_data elif result_type is TrainingResultType.CHECKPOINT: self._checkpoint_manager._process_checkpoint(results) # Iterate until next REPORT call or training has finished. else: raise TrainBackendError(f"Unexpected result type: " f"{result_type}. " f"Expected one of " f"{[type in TrainingResultType]}") def get_final_results(self, force) -> List[T]: if not self.is_finished(): assert self._final_results is None if force: # Pause reporting. self._run_with_error_handling(self.pause_reporting) # Iterate and process remaining checkpoints. # This will also set self._final_results and self._finished_training for _ in self: pass assert self.is_finished else: logger.info("Please finish iterating through the " "intermediate results before getting the" "final returns. If you would like " "training to finish immediately and get " "the final returns, then set " "`force=True`.") return self._final_results

And we no longer need _fetch_next_result or _finish_checkpointing.

Yeah while I do agree we can do something like this, I don't actually understand the original concern:

With these changes, the "processing" of TrainingResults are happening in different places: reports are processed (yielded) in next, while checkpoints are processed in either _fetch_next_result or _finish_checkpointing.

This logic was just moved up from BackendExecutor.fetch_next_result and BackendExecutor.finish_training.

Do you think this refactoring should be done in this PR?

It’s just to simplify the logic and the abstractions. Previously the abstraction was that BackendExecutor would only return report results that the trainer will consume. But now that the trainer is also doing checkpointing, we don’t need an abstraction to only return report results- it just feels like extra redirection to me.

What do you think about the code snippet above?

If it’s a pretty quick fix then we can do it in this pr. If not then we can do it in a follow up.

I tried making this change and one of the tests (test_worker_failure_2) started hanging 😅 I might have missed one of the error handling wrappers or something.

Can I follow up with this in a separate PR? I'd like to spend some more time thinking about this in general as well.

Ok sounds good, let's do this refactor in a separate PR. Should we add a TODO or track this somehow?

Created a ticket to track this here: #20330

Thanks, sounds good!

amogkam

LGTM! Just left some minor comments

python/ray/train/backends/backend.py

python/ray/train/checkpoint.py

python/ray/train/examples/train_linear_example.py

amogkam · 2021-11-13T09:59:42Z

python/ray/train/trainer.py

            return next_results

+    def _fetch_next_result(self) -> Optional[List[Dict]]:


Ok sounds good, let's do this refactor in a separate PR. Should we add a TODO or track this somehow?

python/ray/train/trainer.py

release/ml_user_tests/train/train_torch_linear_test.py

python/ray/train/tests/test_backend.py

…n-client

amogkam · 2021-11-13T23:30:09Z

python/ray/train/tests/test_trainer.py

@@ -1146,11 +1025,12 @@ def train_actor_failure():
        with patch.object(
                new_backend_executor_cls,
                "_get_dataset_shards",
-                return_value=dataset_splits) as mock_method:


amogkam · 2021-11-13T23:30:24Z

python/ray/train/trainer.py

            return next_results

+    def _fetch_next_result(self) -> Optional[List[Dict]]:


Thanks, sounds good!

[train] wrap BackendExecutor in ray.remote()

55611e5

matthewdeng added the do-not-merge Do not merge this PR! label Nov 6, 2021

matthewdeng added 2 commits November 8, 2021 18:19

wip

bf7c877

fix trainer tests

a545a4c

matthewdeng assigned amogkam Nov 9, 2021

amogkam reviewed Nov 10, 2021

View reviewed changes

doc/examples/datasets_train/datasets_train.py Show resolved Hide resolved

python/ray/train/trainer.py Show resolved Hide resolved

matthewdeng and others added 6 commits November 9, 2021 17:45

move CheckpointManager to Trainer

a496271

Merge branch 'master' into train-client

9c88fad

[tune] move force_on_current_node to ml_utils

cef0ef9

fix import

c04c5b0

Merge branch 'force-node' of https://github.com/matthewdeng/ray into …

8bf76ef

…train-client

force on head node

16c1f95

matthewdeng removed the do-not-merge Do not merge this PR! label Nov 10, 2021

matthewdeng marked this pull request as ready for review November 10, 2021 02:38

matthewdeng added 7 commits November 9, 2021 21:34

init ray

5e2dc3a

split test files

60191d3

Merge branch 'master' of https://github.com/ray-project/ray into trai…

381c3b2

…n-client

Merge branch 'master' of https://github.com/ray-project/ray into trai…

6c7312b

…n-client

update example

9df9823

move tests to ray client

090aeb9

Merge branch 'master' of https://github.com/ray-project/ray into trai…

ef01fda

…n-client

amogkam reviewed Nov 12, 2021

View reviewed changes

address comments

47e52d3

matthewdeng requested a review from amogkam November 12, 2021 06:50

move comment

7c895fd

amogkam approved these changes Nov 13, 2021

View reviewed changes

matthewdeng mentioned this pull request Nov 13, 2021

[Train] Refactor TrainingIterator result processing logic #20330

Open

matthewdeng added 2 commits November 13, 2021 11:56

address comments

9e7a752

Merge branch 'master' of https://github.com/ray-project/ray into trai…

d8fe765

…n-client

amogkam approved these changes Nov 13, 2021

View reviewed changes

amogkam merged commit e22632d into ray-project:master Nov 13, 2021

AmeerHajAli mentioned this pull request Nov 15, 2021

[release][nightly][pg] tune_cifar_pytorch_pbt_example does not work on nightly (Placement group error) #20348

Closed

2 tasks

matthewdeng mentioned this pull request Nov 15, 2021

[train] fix Train/Tune integration on Client #20351

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[train] wrap BackendExecutor in ray.remote() #20123

[train] wrap BackendExecutor in ray.remote() #20123

matthewdeng commented Nov 6, 2021 •

edited

Loading

amogkam Nov 12, 2021

matthewdeng Nov 12, 2021

amogkam Nov 12, 2021

amogkam Nov 12, 2021

matthewdeng Nov 12, 2021

amogkam Nov 13, 2021

matthewdeng Nov 13, 2021

amogkam Nov 13, 2021

amogkam left a comment

amogkam Nov 13, 2021

amogkam Nov 13, 2021

amogkam Nov 13, 2021

		return next_results

		def _fetch_next_result(self) -> Optional[List[Dict]]:

[train] wrap BackendExecutor in ray.remote() #20123

[train] wrap BackendExecutor in ray.remote() #20123

Conversation

matthewdeng commented Nov 6, 2021 • edited Loading

Why are these changes needed?

Changes

Related issue number

Checks

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amogkam left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

matthewdeng commented Nov 6, 2021 •

edited

Loading