Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Tune] Add timeout for experiment checkpoint syncing to cloud #30855

Merged
merged 17 commits into from
Dec 7, 2022

Conversation

justinvyu
Copy link
Contributor

@justinvyu justinvyu commented Dec 2, 2022

Why are these changes needed?

#28155 introduced a sync timeout for trainable checkpoint syncing to the cloud, in the case that the sync operation (default is with pyarrow) hangs. This PR adds a similar timeout for experiment checkpoint cloud syncing.

Experiment checkpoints from TrialRunner.checkpoint also sync to the cloud and can run into problems in 2 cases:

  1. If experiment state syncing is forced from trial_runner.checkpoint(force=True):
    • Problem 1: Experiment will hang on the next checkpoint, since syncer.wait() on the hanging process doesn’t use a timeout on thread join
      • This PR introduces a timeout on the thread join which unblocks when the process has been running for more than sync_timeout seconds.
  2. If experiment syncing is not forced (syncing up if needed every sync_period seconds):
    • Problem 2: Experiment will continue, but the sync process will keep hanging, and no new sync up commands will be run.
      • This PR tracks the start time of the sync process, and if the elapsed time is greater than sync_timeout at the next sync_period, then a new sync up process is started.

Todo

  • Add some comments clarifying timeout behavior
  • Are test cases comprehensive enough?
  • Check if it's possible to set a timeout on the actual upload operation (with pyarrow)

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@justinvyu justinvyu added tune Tune-related issues air labels Dec 2, 2022
Copy link
Member

@Yard1 Yard1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@krfricke krfricke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great!!

@krfricke krfricke merged commit ed5b9e5 into ray-project:master Dec 7, 2022
WeichenXu123 pushed a commit to WeichenXu123/ray that referenced this pull request Dec 19, 2022
…oject#30855)

ray-project#28155 introduced a sync timeout for trainable checkpoint syncing to the cloud, in the case that the sync operation (default is with pyarrow) hangs. This PR adds a similar timeout for experiment checkpoint cloud syncing.

Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Weichen Xu <[email protected]>
tamohannes pushed a commit to ju2ez/ray that referenced this pull request Jan 25, 2023
…oject#30855)

ray-project#28155 introduced a sync timeout for trainable checkpoint syncing to the cloud, in the case that the sync operation (default is with pyarrow) hangs. This PR adds a similar timeout for experiment checkpoint cloud syncing.

Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: tmynn <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tune Tune-related issues
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants