Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[release] tune_cloud_gcp_k8s_durable_upload is flaky #30353

Closed
xwjiang2010 opened this issue Nov 16, 2022 · 5 comments
Closed

[release] tune_cloud_gcp_k8s_durable_upload is flaky #30353

xwjiang2010 opened this issue Nov 16, 2022 · 5 comments
Assignees
Labels
flaky-tracker Issue created via Flaky Test Tracker https://flaky-tests.ray.io/ triage Needs triage (eg: priority, bug/not-bug, and owning component)

Comments

@xwjiang2010
Copy link
Contributor

xwjiang2010 commented Nov 16, 2022

For this failed run:
https://buildkite.com/ray-project/release-tests-branch/builds/1198#018481e3-f290-48ea-bf97-bf172f448c97

Traceback (most recent call last):
--
  | File "workloads/run_cloud_test.py", line 1305, in <module>
  | raise err
  | File "workloads/run_cloud_test.py", line 1287, in <module>
  | remote_tune_script,
  | File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 104, in wrapper
  | return getattr(ray, func.__name__)(*args, **kwargs)
  | File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/client/api.py", line 42, in get
  | return self.worker.get(vals, timeout=timeout)
  | File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/client/worker.py", line 434, in get
  | res = self._get(to_get, op_timeout)
  | File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/client/worker.py", line 462, in _get
  | raise err
  | types.RayTaskError(AssertionError): ray::_run_test() (pid=1004, ip=172.18.1.99)
  | File "workloads/run_cloud_test.py", line 1239, in _run_test
  | File "workloads/run_cloud_test.py", line 1175, in test_durable_upload
  | File "workloads/run_cloud_test.py", line 386, in run_resume_flow
  | File "workloads/run_cloud_test.py", line 1117, in between_experiments
  | File "workloads/run_cloud_test.py", line 787, in assert_checkpoint_count
  | AssertionError: Trial 417a5_00003 was not on the driver, but did not observe the expected amount of checkpoints (1 != 2, skipped=0, max_additional=1).

Screen Shot 2022-12-14 at 3 55 36 PM

Issue 1: Trial is hanging on iteration 1 -> no checkpoints ever occurred after the first one.

@xwjiang2010 xwjiang2010 self-assigned this Nov 16, 2022
@xwjiang2010 xwjiang2010 added release-blocker P0 Issue that blocks the release r2.2-failure labels Nov 16, 2022
@matthewdeng matthewdeng added the P0 Issues that should be fixed in short order label Nov 18, 2022
@xwjiang2010 xwjiang2010 added air flaky-tracker Issue created via Flaky Test Tracker https://flaky-tests.ray.io/ and removed P0 Issues that should be fixed in short order release-blocker P0 Issue that blocks the release r2.2-failure labels Nov 18, 2022
@xwjiang2010
Copy link
Contributor Author

dropping release blocker. The test is flaky and back to green. Tracking it as normal flaky release tests.

@matthewdeng
Copy link
Contributor

@xwjiang2010 is this still failing?

@justinvyu
Copy link
Contributor

@justinvyu justinvyu self-assigned this Dec 14, 2022
@justinvyu
Copy link
Contributor

Recent preset dashboard for this test:

Screen Shot 2022-12-14 at 3 57 21 PM

The first 3 out of 5 of these failures (12/7-12/9) were due to rllib import errors.

12/13, 12/14 are both the same issue described below, and we also see same issue in some PR release test runs (flaky example here).

Issue 2: Experiment checkpoint is not fresh enough compared to committed trial checkpoints found in the cloud.

  • The reason why this started being flaky is due to ed5b9e5 changing sync commands to be run as daemon threads in the background.
    • In the release test, the experiment gets interrupted, then Tune tries to sync experiment state one last time, which waits for the previous sync to finish, launches a new one, then exits. (Note that it doesn't explicitly wait for the final launched sync operation to actually finish.)
    • Syncs used to be non-daemon threads, which would actually finish running when the Tune as the python process doesn't exit until all non-daemon threads finish executing. After the change, the final upload doesn't finish running, so the experiment directory ends up out of sync.

See logs:

Traceback (most recent call last):
--
  | File "workloads/run_cloud_test.py", line 1305, in <module>
  | raise err
  | File "workloads/run_cloud_test.py", line 1287, in <module>
  | remote_tune_script,
  | File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 104, in wrapper
  | return getattr(ray, func.__name__)(*args, **kwargs)
  | File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/client/api.py", line 42, in get
  | return self.worker.get(vals, timeout=timeout)
  | File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/client/worker.py", line 434, in get
  | res = self._get(to_get, op_timeout)
  | File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/client/worker.py", line 462, in _get
  | raise err
  | types.RayTaskError(AssertionError): ray::_run_test() (pid=1329, ip=172.18.1.35)
  | File "workloads/run_cloud_test.py", line 1239, in _run_test
  | File "workloads/run_cloud_test.py", line 1175, in test_durable_upload
  | File "workloads/run_cloud_test.py", line 386, in run_resume_flow
  | File "workloads/run_cloud_test.py", line 1127, in between_experiments
  | File "workloads/run_cloud_test.py", line 787, in assert_checkpoint_count
  | AssertionError: Trial 1dd06_00002 was not on the driver, but did not observe the expected amount of checkpoints (1 != 2, skipped=1, max_additional=2).

(_run_test pid=1329) Skipping unobserved checkpoint: /tmp/tune_cloud_testtbig_ncc/test_1671048088/cloud_durable_upload/fn_trainable_1dd06_00002_2_score_multiplied=81_2022-12-14_12-01-40/checkpoint_000014 as 14 > 13
(_run_test pid=1329) Skipping unobserved checkpoint: /tmp/tune_cloud_testtbig_ncc/test_1671048088/cloud_durable_upload/fn_trainable_1dd06_00003_3_score_multiplied=22_2022-12-14_12-01-40/checkpoint_000014 as 14 > 13

Contents of cloud trial directories of what was committed before experiment interrupt:

Trial 0: `checkpoint_000012/` and `checkpoint_000014/`
Trial 1: `checkpoint_000012/` and `checkpoint_000014/`
Trial 2: `checkpoint_000012/` and `checkpoint_000014/`
Trial 3: `checkpoint_000012/` and `checkpoint_000014/`

Experiment state pulled from the cloud:

Trial 0:
internal_iter : 14

Trial 1:
internal_iter : 14

Trial 2:
internal_iter : 13 <- Lagging by 1

Trial 3:
internal_iter : 13 <- Lagging by 1

Note, this is a different issue than the failing test in the original issue description (I've updated the description with the trace/issue seen when that test failed ~1 month ago). I have not seen the first issue pop up recently.

@krfricke krfricke added the triage Needs triage (eg: priority, bug/not-bug, and owning component) label Jan 18, 2023
@krfricke
Copy link
Contributor

Looks like this is resolved

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
flaky-tracker Issue created via Flaky Test Tracker https://flaky-tests.ray.io/ triage Needs triage (eg: priority, bug/not-bug, and owning component)
Projects
None yet
Development

No branches or pull requests

4 participants