-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[tune] Only try to sync driver if sync_to_driver is actually enabled #19589
Conversation
@@ -418,7 +418,7 @@ def _sync_trial_checkpoint(self, trial: "Trial", checkpoint: Checkpoint): | |||
from ray.tune.durable_trainable import DurableTrainable | |||
|
|||
trial_syncer = self._get_trial_syncer(trial) | |||
if trial.sync_on_checkpoint: | |||
if trial.sync_on_checkpoint and self._sync_function is not False: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh I have definitely wondered about this.
why not just "self._sync_function == True:"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or just self._sync_function
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because the sync function can be many things (including None, callable, string) which are all valid.
I know this is confusing, and we're going to tackle this soon - but False
is a special case that is handled in get_node_syncer
as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I did too!
I think it's self._sync_function can be a callable as well and in that case we want to trigger the following logic as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The most confusing there is presumably that even if sync_to_driver=None
, we will get a valid syncer from get_node_syncer
. A refactor is overdue - however, I'll first make sure that the current logic is working and tested
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's add a comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
Why are these changes needed?
Currently, we try to sync to driver even if
sync_to_driver=False
. A no-op is called, but the checkpoint is not detected afterwards (which is expected), so errors are thrown (which should not be the case).Additionally, because the callback fails, the checkpoint manager is not called, and remote checkpoints are not deleted. This leads to
keep_checkpoints_num
being disobeyed whensync_to_driver=False
.This functionality was caught and will be tested in a (wip) release test.
Related issue number
Closes #19192
Checks
scripts/format.sh
to lint the changes in this PR.