-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[data] release test failure : pipelined_ingestion_1500_gb #33846
Comments
I was able to reproduce it with full scale (915 files) of this test: https://console.anyscale-staging.com/o/anyscale-internal/workspaces/expwrk_c29icr8mng8ts8u1d2dagg8tt6/ses_kf2dra2s6xzdti2in5svlx3ldp?command-history-section=command_history I think the issue is likely because the consumers were dead:
Taking
Ray was able to bring back the consumers, but they are then out of sync with the rest consumers, so DatasetPipeline got stuck and then timeout. |
Reproduced in bulk mode, which also saw high memory usage and node got killed. |
One hypothesis is the new dataset iterator doesn't do eager object GC
|
This fix had a successful run: https://buildkite.com/ray-project/release-tests-pr/builds/33701 |
@jianoaix I see this is closed, can we open a cherry pick since this is a release blocker? |
Thanks! |
What happened + What you expected to happen
looks like it timed out
[ERROR 2023-03-28 19:02:48,645] run_release_test.py: 164 Command timed out after 9600.080104424998 seconds.
| Traceback (most recent call last):
| File "ray_release/scripts/run_release_test.py", line 160, in main
| no_terminate=no_terminate,
| File "/tmp/release-a8qe7WenMw/release/ray_release/glue.py", line 488, in run_release_test
| raise pipeline_exception
| File "/tmp/release-a8qe7WenMw/release/ray_release/glue.py", line 372, in run_release_test
| raise e
| File "/tmp/release-a8qe7WenMw/release/ray_release/glue.py", line 364, in run_release_test
| raise_on_timeout=not is_long_running,
| File "/tmp/release-a8qe7WenMw/release/ray_release/command_runner/anyscale_job_runner.py", line 269, in run_command
| job_status_code, error, raise_on_timeout=raise_on_timeout
| File "/tmp/release-a8qe7WenMw/release/ray_release/command_runner/anyscale_job_runner.py", line 174, in _handle_command_output
| f"Command timed out after {workload_time_taken} seconds."
| ray_release.exception.TestCommandTimeout: Command timed out after 9600.080104424998 seconds.
Versions / Dependencies
master
Reproduction script
https://buildkite.com/ray-project/release-tests-branch/builds/1493#01872a7e-b392-4a66-98ea-21091dc3636f
Issue Severity
None
The text was updated successfully, but these errors were encountered: