Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[tune] Only try to sync driver if sync_to_driver is actually enabled #19589

Merged
merged 2 commits into from
Oct 21, 2021

Conversation

krfricke
Copy link
Contributor

@krfricke krfricke commented Oct 21, 2021

Why are these changes needed?

Currently, we try to sync to driver even if sync_to_driver=False. A no-op is called, but the checkpoint is not detected afterwards (which is expected), so errors are thrown (which should not be the case).

Additionally, because the callback fails, the checkpoint manager is not called, and remote checkpoints are not deleted. This leads to keep_checkpoints_num being disobeyed when sync_to_driver=False.

This functionality was caught and will be tested in a (wip) release test.

Related issue number

Closes #19192

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@@ -418,7 +418,7 @@ def _sync_trial_checkpoint(self, trial: "Trial", checkpoint: Checkpoint):
from ray.tune.durable_trainable import DurableTrainable

trial_syncer = self._get_trial_syncer(trial)
if trial.sync_on_checkpoint:
if trial.sync_on_checkpoint and self._sync_function is not False:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh I have definitely wondered about this.
why not just "self._sync_function == True:"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or just self._sync_function?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the sync function can be many things (including None, callable, string) which are all valid.

I know this is confusing, and we're going to tackle this soon - but False is a special case that is handled in get_node_syncer as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I did too!
I think it's self._sync_function can be a callable as well and in that case we want to trigger the following logic as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The most confusing there is presumably that even if sync_to_driver=None, we will get a valid syncer from get_node_syncer. A refactor is overdue - however, I'll first make sure that the current logic is working and tested

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's add a comment?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

@krfricke krfricke merged commit 15cdffe into ray-project:master Oct 21, 2021
@krfricke krfricke deleted the tune/delete-cps-remote branch October 21, 2021 18:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug] [Tune] Checkpoint not found after successful sync down
4 participants