-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[tune/release] Wait for final experiment checkpoint sync to finish #31131
Conversation
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
…y_tune_cloud_gcp
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
Have you tested that both fixes separately resolve the issue?
Yes, I tested both fixes in isolation and they resolved the issue. The forced checkpoint frequency deflaking this test seems to be dependent on the Ex: With
|
…y_tune_cloud_gcp Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
…y_tune_cloud_gcp
Awesome, I'm running the release test now and will merge once it passes. https://buildkite.com/ray-project/release-tests-pr/builds/24195 |
…31131) This PR includes fixes to deflake the `tune_cloud_gcp_k8s_durable_upload` release test, including (1) including a wait for the final experiment checkpoint sync to finish and (2) fixing forced checkpointing frequency logic. Signed-off-by: Justin Yu <[email protected]>
…ay-project#31131) This PR includes fixes to deflake the `tune_cloud_gcp_k8s_durable_upload` release test, including (1) including a wait for the final experiment checkpoint sync to finish and (2) fixing forced checkpointing frequency logic. Signed-off-by: Justin Yu <[email protected]> Signed-off-by: tmynn <[email protected]>
This PR includes fixes to deflake the
tune_cloud_gcp_k8s_durable_upload
release test, including (1) including a wait for the final experiment checkpoint sync to finish and (2) fixing forced checkpointing frequency logic.Why are these changes needed?
Problem causing the
tune_cloud_gcp_k8s_durable_upload
release test to be flaky:Sync commands are now run as daemon threads from ed5b9e5, and they don't wait to be finished when launched at the end of the Tune experiment (triggered by the interrupt).
Another fix this PR includes is: the forced checkpoint frequency (based on
num_to_keep
) has a one off error right now, so it's not forcing experiment checkpointing as frequently as it should.TODO
Related issue number
#30353
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.