Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[tune/release] Wait for final experiment checkpoint sync to finish #31131

Merged
merged 7 commits into from
Dec 22, 2022

Conversation

justinvyu
Copy link
Contributor

@justinvyu justinvyu commented Dec 15, 2022

This PR includes fixes to deflake the tune_cloud_gcp_k8s_durable_upload release test, including (1) including a wait for the final experiment checkpoint sync to finish and (2) fixing forced checkpointing frequency logic.

Why are these changes needed?

Problem causing the tune_cloud_gcp_k8s_durable_upload release test to be flaky:

Sync commands are now run as daemon threads from ed5b9e5, and they don't wait to be finished when launched at the end of the Tune experiment (triggered by the interrupt).

Another fix this PR includes is: the forced checkpoint frequency (based on num_to_keep) has a one off error right now, so it's not forcing experiment checkpointing as frequently as it should.

TODO

  • Should the final experiment sync have a shorter timeout?

Related issue number

#30353

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@justinvyu justinvyu marked this pull request as ready for review December 16, 2022 01:45
Copy link
Contributor

@krfricke krfricke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Have you tested that both fixes separately resolve the issue?

@justinvyu
Copy link
Contributor Author

justinvyu commented Dec 16, 2022

Thanks!

Have you tested that both fixes separately resolve the issue?

Yes, I tested both fixes in isolation and they resolved the issue.

The forced checkpoint frequency deflaking this test seems to be dependent on the num_to_keep being small. If I bumped this up and the forced checkpoint didn't happen as frequently, then it'd still need to rely on the other fix to make sure the final experiment sync goes through.

Ex: With num_to_keep = 3

  • Checkpoints 0, 2, 4 come in, triggering first forced experiment checkpoint + sync.
  • Checkpoint 6 comes in, then experiment gets interrupted, without checkpointing one last time.
  • Experiment state is iteration 4. Committed checkpoints are at 2, 4, 6.

@krfricke
Copy link
Contributor

Awesome, I'm running the release test now and will merge once it passes.

https://buildkite.com/ray-project/release-tests-pr/builds/24195

@krfricke krfricke merged commit 9842429 into ray-project:master Dec 22, 2022
AmeerHajAli pushed a commit that referenced this pull request Jan 12, 2023
…31131)

This PR includes fixes to deflake the `tune_cloud_gcp_k8s_durable_upload` release test, including (1) including a wait for the final experiment checkpoint sync to finish and (2) fixing forced checkpointing frequency logic.

Signed-off-by: Justin Yu <[email protected]>
tamohannes pushed a commit to ju2ez/ray that referenced this pull request Jan 25, 2023
…ay-project#31131)

This PR includes fixes to deflake the `tune_cloud_gcp_k8s_durable_upload` release test, including (1) including a wait for the final experiment checkpoint sync to finish and (2) fixing forced checkpointing frequency logic.

Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: tmynn <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants