[release] tune_cloud_gcp_k8s_durable_upload is flaky #30353

xwjiang2010 · 2022-11-16T20:31:42Z

For this failed run:
https://buildkite.com/ray-project/release-tests-branch/builds/1198#018481e3-f290-48ea-bf97-bf172f448c97

Traceback (most recent call last):
--
  | File "workloads/run_cloud_test.py", line 1305, in <module>
  | raise err
  | File "workloads/run_cloud_test.py", line 1287, in <module>
  | remote_tune_script,
  | File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 104, in wrapper
  | return getattr(ray, func.__name__)(*args, **kwargs)
  | File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/client/api.py", line 42, in get
  | return self.worker.get(vals, timeout=timeout)
  | File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/client/worker.py", line 434, in get
  | res = self._get(to_get, op_timeout)
  | File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/client/worker.py", line 462, in _get
  | raise err
  | types.RayTaskError(AssertionError): ray::_run_test() (pid=1004, ip=172.18.1.99)
  | File "workloads/run_cloud_test.py", line 1239, in _run_test
  | File "workloads/run_cloud_test.py", line 1175, in test_durable_upload
  | File "workloads/run_cloud_test.py", line 386, in run_resume_flow
  | File "workloads/run_cloud_test.py", line 1117, in between_experiments
  | File "workloads/run_cloud_test.py", line 787, in assert_checkpoint_count
  | AssertionError: Trial 417a5_00003 was not on the driver, but did not observe the expected amount of checkpoints (1 != 2, skipped=0, max_additional=1).

Issue 1: Trial is hanging on iteration 1 -> no checkpoints ever occurred after the first one.

The text was updated successfully, but these errors were encountered:

xwjiang2010 · 2022-11-21T16:54:20Z

dropping release blocker. The test is flaky and back to green. Tracking it as normal flaky release tests.

matthewdeng · 2022-12-09T21:44:32Z

@xwjiang2010 is this still failing?

justinvyu · 2022-12-14T02:10:52Z

Still flaky, I am looking into this.

Most recent nightly (failed): https://buildkite.com/ray-project/release-tests-branch/builds/1252#01850d90-f977-46a3-a639-8cb42a6818f6
Yesterday (passed): https://buildkite.com/ray-project/release-tests-branch/builds/1251#0185086b-9bef-4053-800d-dd7b6be9045b

justinvyu · 2022-12-15T07:13:47Z

Recent preset dashboard for this test:

The first 3 out of 5 of these failures (12/7-12/9) were due to rllib import errors.

12/13, 12/14 are both the same issue described below, and we also see same issue in some PR release test runs (flaky example here).

Issue 2: Experiment checkpoint is not fresh enough compared to committed trial checkpoints found in the cloud.

The reason why this started being flaky is due to ed5b9e5 changing sync commands to be run as daemon threads in the background.
- In the release test, the experiment gets interrupted, then Tune tries to sync experiment state one last time, which waits for the previous sync to finish, launches a new one, then exits. (Note that it doesn't explicitly wait for the final launched sync operation to actually finish.)
- Syncs used to be non-daemon threads, which would actually finish running when the Tune as the python process doesn't exit until all non-daemon threads finish executing. After the change, the final upload doesn't finish running, so the experiment directory ends up out of sync.

See logs:

Traceback (most recent call last):
--
  | File "workloads/run_cloud_test.py", line 1305, in <module>
  | raise err
  | File "workloads/run_cloud_test.py", line 1287, in <module>
  | remote_tune_script,
  | File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 104, in wrapper
  | return getattr(ray, func.__name__)(*args, **kwargs)
  | File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/client/api.py", line 42, in get
  | return self.worker.get(vals, timeout=timeout)
  | File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/client/worker.py", line 434, in get
  | res = self._get(to_get, op_timeout)
  | File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/client/worker.py", line 462, in _get
  | raise err
  | types.RayTaskError(AssertionError): ray::_run_test() (pid=1329, ip=172.18.1.35)
  | File "workloads/run_cloud_test.py", line 1239, in _run_test
  | File "workloads/run_cloud_test.py", line 1175, in test_durable_upload
  | File "workloads/run_cloud_test.py", line 386, in run_resume_flow
  | File "workloads/run_cloud_test.py", line 1127, in between_experiments
  | File "workloads/run_cloud_test.py", line 787, in assert_checkpoint_count
  | AssertionError: Trial 1dd06_00002 was not on the driver, but did not observe the expected amount of checkpoints (1 != 2, skipped=1, max_additional=2).

(_run_test pid=1329) Skipping unobserved checkpoint: /tmp/tune_cloud_testtbig_ncc/test_1671048088/cloud_durable_upload/fn_trainable_1dd06_00002_2_score_multiplied=81_2022-12-14_12-01-40/checkpoint_000014 as 14 > 13
(_run_test pid=1329) Skipping unobserved checkpoint: /tmp/tune_cloud_testtbig_ncc/test_1671048088/cloud_durable_upload/fn_trainable_1dd06_00003_3_score_multiplied=22_2022-12-14_12-01-40/checkpoint_000014 as 14 > 13

Contents of cloud trial directories of what was committed before experiment interrupt:

Trial 0: `checkpoint_000012/` and `checkpoint_000014/`
Trial 1: `checkpoint_000012/` and `checkpoint_000014/`
Trial 2: `checkpoint_000012/` and `checkpoint_000014/`
Trial 3: `checkpoint_000012/` and `checkpoint_000014/`

Experiment state pulled from the cloud:

Trial 0:
internal_iter : 14

Trial 1:
internal_iter : 14

Trial 2:
internal_iter : 13 <- Lagging by 1

Trial 3:
internal_iter : 13 <- Lagging by 1

Note, this is a different issue than the failing test in the original issue description (I've updated the description with the trace/issue seen when that test failed ~1 month ago). I have not seen the first issue pop up recently.

krfricke · 2023-01-19T23:38:52Z

Looks like this is resolved

xwjiang2010 self-assigned this Nov 16, 2022

xwjiang2010 added release-blocker P0 Issue that blocks the release r2.2-failure labels Nov 16, 2022

matthewdeng added the P0 Issues that should be fixed in short order label Nov 18, 2022

xwjiang2010 added air flaky-tracker Issue created via Flaky Test Tracker https://flaky-tests.ray.io/ and removed P0 Issues that should be fixed in short order release-blocker P0 Issue that blocks the release r2.2-failure labels Nov 18, 2022

justinvyu self-assigned this Dec 14, 2022

justinvyu mentioned this issue Dec 15, 2022

[tune/release] Wait for final experiment checkpoint sync to finish #31131

Merged

8 tasks

krfricke added the triage Needs triage (eg: priority, bug/not-bug, and owning component) label Jan 18, 2023

krfricke closed this as completed Jan 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[release] tune_cloud_gcp_k8s_durable_upload is flaky #30353

[release] tune_cloud_gcp_k8s_durable_upload is flaky #30353

xwjiang2010 commented Nov 16, 2022 •

edited by justinvyu

Loading

xwjiang2010 commented Nov 21, 2022

matthewdeng commented Dec 9, 2022

justinvyu commented Dec 14, 2022

justinvyu commented Dec 15, 2022

krfricke commented Jan 19, 2023

[release] tune_cloud_gcp_k8s_durable_upload is flaky #30353

[release] tune_cloud_gcp_k8s_durable_upload is flaky #30353

Comments

xwjiang2010 commented Nov 16, 2022 • edited by justinvyu Loading

xwjiang2010 commented Nov 21, 2022

matthewdeng commented Dec 9, 2022

justinvyu commented Dec 14, 2022

justinvyu commented Dec 15, 2022

krfricke commented Jan 19, 2023

xwjiang2010 commented Nov 16, 2022 •

edited by justinvyu

Loading