Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Tune] Make ResultGrid return cloud checkpoints #31437

Merged

Conversation

Yard1
Copy link
Member

@Yard1 Yard1 commented Jan 4, 2023

Signed-off-by: Antoni Baum [email protected]

Why are these changes needed?

Checkpoints returned by ResultGrid will never point to a cloud URI even if cloud syncing is enabled. This PR fixes this by saving information necessary to turn a local path into a remote path in _TrackedCheckpoint, and applying it during conversion to an AIR checkpoint.

An alternate approach would be to save remote_upload_dir and logdir in _TrackedCheckpoint instead of a function, or for Trainable.save to return a tuple of (local_path, remote_path) if saved to cloud.

Related issue number

Closes #31001
Closes #28492

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Copy link
Contributor

@justinvyu justinvyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, I have a few suggested changes:

python/ray/tune/result_grid.py Outdated Show resolved Hide resolved
python/ray/tune/tests/test_result_grid.py Show resolved Hide resolved
@Yard1 Yard1 marked this pull request as draft January 4, 2023 22:49
Signed-off-by: Antoni Baum <[email protected]>
@Yard1 Yard1 marked this pull request as ready for review January 4, 2023 23:07
@Yard1
Copy link
Member Author

Yard1 commented Jan 4, 2023

@justinvyu different approach, PTAL!

Signed-off-by: Antoni Baum <[email protected]>
@Yard1 Yard1 requested a review from justinvyu January 4, 2023 23:15
Signed-off-by: Antoni Baum <[email protected]>
Copy link
Contributor

@justinvyu justinvyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, some nits:

@@ -185,6 +185,15 @@ def get_checkpoints_paths(logdir):
)
return chkpt_df

@staticmethod
def get_remote_storage_path(
local_path: str, logdir: str, remote_checkpoint_dir: str
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we rename remote_checkpoint_dir to remote_logdir? Seems like two different concepts with the current naming but one is just the cloud version of the other.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wanted to use the same names as in Trial.

python/ray/tune/trainable/util.py Outdated Show resolved Hide resolved
python/ray/air/_internal/checkpoint_manager.py Outdated Show resolved Hide resolved
Copy link
Contributor

@krfricke krfricke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@krfricke krfricke merged commit 2e3d4bc into ray-project:master Jan 16, 2023
@Yard1 Yard1 deleted the fix_result_grid_checkpoint_cloud_path branch January 16, 2023 21:24
andreapiso pushed a commit to andreapiso/ray that referenced this pull request Jan 22, 2023
`Checkpoint`s returned by `ResultGrid` will never point to a cloud URI even if cloud syncing is enabled. This PR fixes this by saving information necessary to turn a local path into a remote path in `_TrackedCheckpoint`, and applying it during conversion to an AIR checkpoint.

An alternate approach would be to save `remote_upload_dir` and `logdir` in `_TrackedCheckpoint` instead of a function, or for `Trainable.save` to return a tuple of `(local_path, remote_path)` if saved to cloud.

Signed-off-by: Antoni Baum <[email protected]>
Co-authored-by: Justin Yu <[email protected]>
Signed-off-by: Andrea Pisoni <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants