-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[release] tune_cloud_gcp_k8s_durable_upload is flaky #30353
Comments
dropping release blocker. The test is flaky and back to green. Tracking it as normal flaky release tests. |
@xwjiang2010 is this still failing? |
Still flaky, I am looking into this. |
Recent preset dashboard for this test: The first 3 out of 5 of these failures (12/7-12/9) were due to rllib import errors. 12/13, 12/14 are both the same issue described below, and we also see same issue in some PR release test runs (flaky example here). Issue 2: Experiment checkpoint is not fresh enough compared to committed trial checkpoints found in the cloud.
See logs:
Contents of cloud trial directories of what was committed before experiment interrupt:
Experiment state pulled from the cloud:
Note, this is a different issue than the failing test in the original issue description (I've updated the description with the trace/issue seen when that test failed ~1 month ago). I have not seen the first issue pop up recently. |
Looks like this is resolved |
For this failed run:
https://buildkite.com/ray-project/release-tests-branch/builds/1198#018481e3-f290-48ea-bf97-bf172f448c97
Issue 1: Trial is hanging on iteration 1 -> no checkpoints ever occurred after the first one.
The text was updated successfully, but these errors were encountered: