-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Train] Improvements to fault tolerance #22511
[Train] Improvements to fault tolerance #22511
Conversation
python/ray/train/backend.py
Outdated
worker_group.start() | ||
logger.info("Setting up distributed backend on all workers.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these logs too verbose?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm, most people don't run with log level DEBUG, and for very long running training jobs (15+ hours) I think these messages could be useful and it's not feasible to run the training job again.
cc @matthewdeng would also like to hear your thoughts here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think INFO is fine given that this is only logged during failures, which should hopefully be rare.
python/ray/train/utils.py
Outdated
if at_least_one_failed_worker and len(unfinished) > 0: | ||
# If at least one worker has failed, but there are still workers | ||
# that we are waiting on results from, these workers are hanging. | ||
# Treat these workers as dead workers as well. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we actually need to treat these as dead? Or is it sufficient that we break out of this ray.wait
call and only return the failed worker as dead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point, we don't need to treat them as dead. Will make the change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually no, these have to be marked as dead because new actors may not be created. And these actors are no longer usable since they are hanging on a particular method execution.
python/ray/train/utils.py
Outdated
# If a failure occurs the ObjectRef will be marked as finished. | ||
# Calling ray.get will expose the failure as a RayActorError. | ||
for object_ref in finished: | ||
# Everything in finished has either failed or |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or what 😨
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or completed successfully 😁...good catch, updated!
python/ray/train/utils.py
Outdated
# are alive, but hanging on collective calls because other | ||
# workers have failed. | ||
timeout = ( | ||
REMAINING_WORKERS_GRACE_PERIOD_S if at_least_one_failed_worker else None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change REMAINING_WORKERS_GRACE_PERIOD_S
to support being set as an environment variable? Is 10 seconds too short?
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.
|
…n-fault-tolerance-logging
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM - I think you just need to pull in master to fix the lint issues.
python/ray/train/utils.py
Outdated
return False, dead_worker_indexes | ||
else: | ||
return True, [] | ||
at_least_one_failed_worker = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Just return False
here? Or remove and not at_least_one_failed_worker
in line 48 if you want to print all failed workers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point, updated!
…n-fault-tolerance-logging
…y into train-fault-tolerance-logging
Various improvements to Ray Train fault tolerance.
The default torch process group timeout is 30 minutes. If a failure occurs before a gradient synchronization, training will hang for 30 minutes before raising an error and triggering fault tolerance. This PR reduces the default timeout_s to 30 seconds.Adds functionality to trigger fault tolerance even if any of the alive workers are hanging by specifying a grace period for alive workers.Simplifies fault tolerance by removing backend specific
handle_failure
. If any workers have failed, all workers will be restarted and training will continue from the last checkpoint.Also adds a test for fault tolerance with an actual torch example. When testing locally, the test hangs before the fix, but passes after.
Why are these changes needed?
Related issue number
Checks
scripts/format.sh
to lint the changes in this PR.