[Train] Improvements to fault tolerance #22511

amogkam · 2022-02-19T00:51:49Z

Various improvements to Ray Train fault tolerance.

Add more log statements for better debugging of Ray Train failure handling.
Fixes [Bug] [Train] Cannot reproduce fault-tolerance, script hangs upon any node shutdown #22349. The default torch process group timeout is 30 minutes. If a failure occurs before a gradient synchronization, training will hang for 30 minutes before raising an error and triggering fault tolerance. This PR reduces the default timeout_s to 30 seconds.
~~Adds functionality to trigger fault tolerance even if any of the alive workers are hanging by specifying a grace period for alive workers.~~
Simplifies fault tolerance by removing backend specific handle_failure. If any workers have failed, all workers will be restarted and training will continue from the last checkpoint.
Also adds a test for fault tolerance with an actual torch example. When testing locally, the test hangs before the fix, but passes after.

Why are these changes needed?

Related issue number

Checks

I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

python/ray/train/backend.py

python/ray/train/trainer.py

python/ray/train/torch.py

jovany-wang · 2022-02-22T04:52:08Z

python/ray/train/backend.py

        worker_group.start()
+        logger.info("Setting up distributed backend on all workers.")


Are these logs too verbose?

Hmmm, most people don't run with log level DEBUG, and for very long running training jobs (15+ hours) I think these messages could be useful and it's not feasible to run the training job again.

cc @matthewdeng would also like to hear your thoughts here.

I think INFO is fine given that this is only logged during failures, which should hopefully be rare.

python/ray/train/utils.py

matthewdeng · 2022-02-23T06:07:18Z

python/ray/train/utils.py

+        if at_least_one_failed_worker and len(unfinished) > 0:
+            # If at least one worker has failed, but there are still workers
+            # that we are waiting on results from, these workers are hanging.
+            # Treat these workers as dead workers as well.


Do we actually need to treat these as dead? Or is it sufficient that we break out of this ray.wait call and only return the failed worker as dead?

That's a good point, we don't need to treat them as dead. Will make the change.

Actually no, these have to be marked as dead because new actors may not be created. And these actors are no longer usable since they are hanging on a particular method execution.

matthewdeng · 2022-02-23T06:07:48Z

python/ray/train/utils.py

        # If a failure occurs the ObjectRef will be marked as finished.
        # Calling ray.get will expose the failure as a RayActorError.
        for object_ref in finished:
+            # Everything in finished has either failed or


or what 😨

Or completed successfully 😁...good catch, updated!

matthewdeng · 2022-02-23T06:08:34Z

python/ray/train/utils.py

+        # are alive, but hanging on collective calls because other
+        # workers have failed.
+        timeout = (
+            REMAINING_WORKERS_GRACE_PERIOD_S if at_least_one_failed_worker else None


Change REMAINING_WORKERS_GRACE_PERIOD_S to support being set as an environment variable? Is 10 seconds too short?

stale · 2022-03-27T02:34:25Z

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

If you'd like to keep this open, just leave any comment, and the stale label will be removed.

…n-fault-tolerance-logging

python/ray/train/backend.py

python/ray/train/utils.py

matthewdeng

LGTM - I think you just need to pull in master to fix the lint issues.

matthewdeng · 2022-03-29T18:21:59Z

python/ray/train/utils.py

-        return False, dead_worker_indexes
-    else:
-        return True, []
+                at_least_one_failed_worker = True


nit: Just return False here? Or remove and not at_least_one_failed_worker in line 48 if you want to print all failed workers.

good point, updated!

…n-fault-tolerance-logging

…y into train-fault-tolerance-logging

amogkam added 3 commits February 18, 2022 16:44

improvements

697e15f

change to 30 seconds

c9fc854

move

5dbb745

amogkam assigned matthewdeng and bveeramani Feb 19, 2022

fix log message

83ed383

amogkam mentioned this pull request Feb 19, 2022

[Bug] [Train] Cannot reproduce fault-tolerance, script hangs upon any node shutdown #22349

Closed

matthewdeng reviewed Feb 19, 2022

View reviewed changes

python/ray/train/backend.py Outdated Show resolved Hide resolved

python/ray/train/trainer.py Outdated Show resolved Hide resolved

python/ray/train/torch.py Outdated Show resolved Hide resolved

amogkam added 2 commits February 18, 2022 19:51

change

f4f70e8

rename

a43cf39

jovany-wang reviewed Feb 22, 2022

View reviewed changes

matthewdeng reviewed Feb 23, 2022

View reviewed changes

wip

763e474

stale bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Mar 27, 2022

wip

305e80a

stale bot removed the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Mar 28, 2022

amogkam added 3 commits March 28, 2022 16:37

wip

54ab939

Merge branch 'master' of https://github.com/ray-project/ray into trai…

7834896

…n-fault-tolerance-logging

hard restarts for all cases

1258b0a

amogkam commented Mar 29, 2022

View reviewed changes

python/ray/train/backend.py Outdated Show resolved Hide resolved

Update python/ray/train/backend.py

4a7c3b1

amogkam commented Mar 29, 2022

View reviewed changes

python/ray/train/utils.py Outdated Show resolved Hide resolved

Update python/ray/train/utils.py

2fa3a43

amogkam requested a review from matthewdeng March 29, 2022 01:28

matthewdeng approved these changes Mar 29, 2022

View reviewed changes

matthewdeng reviewed Mar 29, 2022

View reviewed changes

amogkam added 3 commits March 29, 2022 11:25

Merge branch 'master' of https://github.com/ray-project/ray into trai…

66149da

…n-fault-tolerance-logging

address comments

9b8f73c

Merge branch 'train-fault-tolerance-logging' of github.com:amogkam/ra…

623dc25

…y into train-fault-tolerance-logging

amogkam merged commit 0b8c219 into ray-project:master Mar 29, 2022

amogkam deleted the train-fault-tolerance-logging branch March 29, 2022 22:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Train] Improvements to fault tolerance #22511

[Train] Improvements to fault tolerance #22511

amogkam commented Feb 19, 2022 •

edited

Loading

jovany-wang Feb 22, 2022

amogkam Feb 25, 2022

matthewdeng Feb 25, 2022

matthewdeng Feb 23, 2022

amogkam Feb 25, 2022

amogkam Feb 25, 2022

matthewdeng Feb 23, 2022

amogkam Feb 25, 2022

matthewdeng Feb 23, 2022

stale bot commented Mar 27, 2022

matthewdeng left a comment

matthewdeng Mar 29, 2022

amogkam Mar 29, 2022

		worker_group.start()
		logger.info("Setting up distributed backend on all workers.")

[Train] Improvements to fault tolerance #22511

[Train] Improvements to fault tolerance #22511

Conversation

amogkam commented Feb 19, 2022 • edited Loading

Why are these changes needed?

Related issue number

Checks

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stale bot commented Mar 27, 2022

matthewdeng left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amogkam commented Feb 19, 2022 •

edited

Loading