Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[tune] Only sync down from cloud if needed #26725

Merged
merged 6 commits into from
Jul 25, 2022

Conversation

krfricke
Copy link
Contributor

@krfricke krfricke commented Jul 19, 2022

Signed-off-by: Kai Fricke [email protected]

Why are these changes needed?

Currently, trainables will try to sync up/down temporary checkpoints from cloud storage, leading to errors. These erros come up e.g. with PBT, which heavily uses saving/restoring from objects.

Instead, we should not sync these temporary checkpoints up at all, and we should generally not sync down if a local checkpoint directory exists, which will prevent us also from trying to sync down non-existent temporary checkpoint directories.

See #26714

Related issue number

Closes #26714

Related PR for 1.13.1: #26717

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@krfricke krfricke marked this pull request as ready for review July 25, 2022 13:31
@@ -487,7 +490,8 @@ def save(self, checkpoint_dir: Optional[str] = None) -> str:
TrainableUtil.write_metadata(checkpoint_dir, metadata)

# Maybe sync to cloud
self._maybe_save_to_cloud(checkpoint_dir)
if not prevent_upload:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about syncing to driver?
I recall seeing these tmp folders on driver...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed, out of scope for this PR :-)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I'll address that in a different PR - this PR actually makes PBT fail while syncing to driver is not great but at least the experiment will continue to run).

@xwjiang2010
Copy link
Contributor

let’s revisit this whole pbt/syncing business after the release push. Ideally I would like to also add a e2e test using the user's repro script, if it makes sense for us.

@krfricke krfricke merged commit 1b06e7a into ray-project:master Jul 25, 2022
@krfricke krfricke deleted the tune/maybe-load-from-cloud branch July 25, 2022 20:49
Rohan138 pushed a commit to Rohan138/ray that referenced this pull request Jul 28, 2022
Currently, trainables will try to sync up/down temporary checkpoints from cloud storage, leading to errors. These erros come up e.g. with PBT, which heavily uses saving/restoring from objects.

Instead, we should not sync these temporary checkpoints up at all, and we should generally not sync down if a local checkpoint directory exists, which will prevent us also from trying to sync down non-existent temporary checkpoint directories.

See ray-project#26714

Signed-off-by: Kai Fricke <[email protected]>
Signed-off-by: Rohan138 <[email protected]>
Stefan-1313 pushed a commit to Stefan-1313/ray_mod that referenced this pull request Aug 18, 2022
Currently, trainables will try to sync up/down temporary checkpoints from cloud storage, leading to errors. These erros come up e.g. with PBT, which heavily uses saving/restoring from objects.

Instead, we should not sync these temporary checkpoints up at all, and we should generally not sync down if a local checkpoint directory exists, which will prevent us also from trying to sync down non-existent temporary checkpoint directories.

See ray-project#26714

Signed-off-by: Kai Fricke <[email protected]>
Signed-off-by: Stefan van der Kleij <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Ray component: Tune S3 sync failure on restore_from_object
3 participants