[release][CI] dataset_shuffle_random_shuffle_1tb failed #29294

rickyyx · 2022-10-13T00:45:34Z

What happened + What you expected to happen


Traceback (most recent call last):
  File "dataset/sort.py", line 165, in <module>
    raise exc
  File "dataset/sort.py", line 117, in <module>
    ds = ds.random_shuffle()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/data/dataset.py", line 857, in random_shuffle
    return Dataset(plan, self._epoch, self._lazy)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/data/dataset.py", line 219, in __init__
    self._plan.execute(allow_clear_input_blocks=False)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/data/_internal/plan.py", line 310, in execute
    blocks, clear_input_blocks, self._run_by_consumer
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/data/_internal/plan.py", line 765, in __call__
    blocks, clear_input_blocks, self.block_udf, self.ray_remote_args
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/data/_internal/stage_impl.py", line 118, in do_shuffle
    reduce_ray_remote_args=remote_args,
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/data/_internal/shuffle.py", line 117, in execute
    new_metadata = reduce_bar.fetch_until_complete(list(new_metadata))
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/data/_internal/progress_bar.py", line 75, in fetch_until_complete
    for ref, result in zip(done, ray.get(done)):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/worker.py", line 2289, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError: ray::reduce() (pid=812, ip=172.31.116.182)
  At least one of the input arguments for this task could not be computed:
ray.exceptions.ObjectFetchTimedOutError: Failed to retrieve object a634bb3b3ca530f0ffffffffffffffffffffffff0200000002000000. To see information about where this ObjectRef was created in Python, set the environment variable RAY_record_ref_creation_sites=1 during `ray start` and `ray.init()`.

Fetch for object a634bb3b3ca530f0ffffffffffffffffffffffff0200000002000000 timed out because no locations were found for the object. This may indicate a system-level bug.

But looks like this might be root cause:

worker.py:1839 -- The node with node id: cfa8ba8a1da67a2ca0d324cfa1f25379459d4e4595534bc9339d572f and address: 172.31.127.55 and node name: 172.31.127.55 has been marked dead because the detector has missed too many heartbeats from it. This can happen when a        (1) raylet crashes unexpectedly (OOM, preempted node, etc.) 
        (2) raylet has lagging heartbeats due to slow network or busy workload.

Shuffle Map:  58%|█████▊    | 583/1000 [13:12<16:19,  2.35s/it]  2022-10-12 13:13:42,685        WARNING worker.py:1839 -- A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: 80725d1404687f199f85f783b5726797df932a6b02000000 Worker ID: f5ad619ac9eb377ba1940e4aeee99e325b05a9974c0e9294bac3cb2c Node ID: 333d58f0360b87d85caac78f08855930c3209c04eb5f819d2ee63610 Worker IP address: 172.31.119.129 Worker port: 10005 Worker PID: 1049 Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
(raylet, ip=172.31.127.55) [2022-10-12 13:13:42,754 C 79 132] (raylet) node_manager.cc:173: This node has beem marked as dead.
(raylet, ip=172.31.127.55) *** StackTrace Information ***
(raylet, ip=172.31.127.55) /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x49bd1a) [0x563db3153d1a] ray::operator<<()
(raylet, ip=172.31.127.55) /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x49d7f2) [0x563db31557f2] ray::SpdLogMessage::Flush()
(raylet, ip=172.31.127.55) /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x49db07) [0x563db3155b07] ray::RayLog::~RayLog()
(raylet, ip=172.31.127.55) /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x242464) [0x563db2efa464] std::_Function_handler<>::_M_invoke()
(raylet, ip=172.31.127.55) /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x375bc4) [0x563db302dbc4] std::_Function_handler<>::_M_invoke()
(raylet, ip=172.31.127.55) /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x3cabd0) [0x563db3082bd0] ray::rpc::GcsRpcClient::ReportHeartbeat()::{lambda()#2}::operator()()
(raylet, ip=172.31.127.55) /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x373a32) [0x563db302ba32] ray::rpc::ClientCallImpl<>::OnReplyReceived()
(raylet, ip=172.31.127.55) /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x2290b5) [0x563db2ee10b5] std::_Function_handler<>::_M_invoke()
(raylet, ip=172.31.127.55) /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x47fb46) [0x563db3137b46] EventTracker::RecordExecution()
(raylet, ip=172.31.127.55) /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x42030e) [0x563db30d830e] std::_Function_handler<>::_M_invoke()
(raylet, ip=172.31.127.55) /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x420786) [0x563db30d8786] boost::asio::detail::completion_handler<>::do_complete()
(raylet, ip=172.31.127.55) /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x9ada0b) [0x563db3665a0b] boost::asio::detail::scheduler::do_run_one()
(raylet, ip=172.31.127.55) /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x9af1d1) [0x563db36671d1] boost::asio::detail::scheduler::run()
(raylet, ip=172.31.127.55) /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x9af400) [0x563db3667400] boost::asio::io_context::run()
(raylet, ip=172.31.127.55) /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x9fe110) [0x563db36b6110] execute_native_thread_routine
(raylet, ip=172.31.127.55) /lib/x86_64-linux-gnu/libpthread.so.0(+0x8609) [0x7ff78c649609] start_thread
(raylet, ip=172.31.127.55) /lib/x86_64-linux-gnu/libc.so.6(clone+0x43) [0x7ff78c218133] __clone
(raylet, ip=172.31.127.55)

Versions / Dependencies

Release 2.1 master

Reproduction script

NA

Issue Severity

No response

The text was updated successfully, but these errors were encountered:

scv119 · 2022-10-13T05:13:18Z

for this test, seems raylet overloaded and failure is not uncommon; however previously the test could handle these node failures (most likely mitigated by lineage reconstruction)

scv119 · 2022-10-14T18:27:46Z

@clarkzinzow i'm a bit overloaded right now (5 total release blocker). Mind take this one?

rickyyx · 2022-10-17T20:21:28Z

Might be due to agent memory leak #29199

clarkzinzow · 2022-10-18T00:07:22Z

I've been looking into this, but taking a step back, it looks like this release test has not had a good success rate since August: https://b534fd88.us1a.app.preset.io/superset/dashboard/19/?native_filters_key=V5NMHssrZzj86978-asD6zeCWfwL9FF61U2d4tZ_PE3wDDt6-2hTI6MMw_UcRjIK

clarkzinzow · 2022-10-18T00:11:59Z

Most recent failure looks to be due to failing to connect to the GCS, making the agent memory leak causing the head node to get OOMKilled a bit more likely: https://console.anyscale.com/o/anyscale-internal/projects/prj_FKRmeV5pA6X72aVscFALNC32/clusters/ses_GhqMXjBcR8nutFSsTCc8Z7yp?command-history-section=command_history

    [2022-10-17 16:29:03,538 C 243 243] (raylet) gcs_rpc_client.h:537:  Check failed: absl::ToInt64Seconds(absl::Now() - gcs_last_alive_time_) < ::RayConfig::instance().gcs_rpc_server_reconnect_timeout_s() Failed to connect to GCS within 60 seconds

rkooo567 · 2022-10-18T10:17:31Z

I think it is unlikely it's related to agent leak. The agent leak is not that fast (it fails within 8 minutes. Agent leak is about 1GB / h).

clarkzinzow · 2022-10-18T18:27:37Z

@rkooo567 agreed!

c21 · 2022-10-19T23:18:08Z

This test looks flaky, get a succeessful run on release branch later - https://buildkite.com/ray-project/release-tests-branch/builds/1123#0183d941-4949-4929-b3c8-233d0ca816e1 . Removed as release blocker.

rickyyx · 2022-10-25T21:35:26Z

Should we declare this unstable? cc @scv119

jjyao · 2022-10-27T23:54:26Z

@scv119 I think it should be P0 for core in 2.2.

clarng · 2022-10-31T19:12:28Z

should we try the oom killer if the node is timing out due to oom / freezing?

scv119 · 2022-11-01T03:43:36Z

can data team take over this test? @c21 @matthewdeng
after some investigation I think this test regression is highly correlated with our infra memory capacity change: now we only allocate 90% of host memory to ray container and it increases the likelyhood of ray running into OOM issues.
As an evidence a commit (74f28f9) that was previously stable also fails in my recent tests https://console.anyscale.com/o/anyscale-internal/projects/prj_FKRmeV5pA6X72aVscFALNC32/clusters/ses_trJuhJfC4WH38x5ziRuzsxCz?command-history-section=command_history

c21 · 2022-11-01T18:39:35Z

Hi @scv119 - yes I think data team should own it. We should work with core to come up action items.

jjyao · 2022-11-01T18:41:12Z

Feel we should try oom killer or use a bigger machine?

clarng · 2022-11-07T18:27:23Z

@c21 do you mind taking the action item ? Seems Clark already has another P0

c21 · 2022-11-08T05:10:29Z

@clarng - sure, I will take a look later.

clarng · 2022-11-14T18:45:49Z

looks like it is green in the recent runs e.g.

https://buildkite.com/ray-project/release-tests-branch/builds/1191#0184639a-d25b-46ce-b2c6-03812cc8ef1b
https://buildkite.com/ray-project/release-tests-branch/builds/1192#018468bc-c2b4-499a-922b-a3b91f40dd9c

rickyyx added bug Something that is supposed to be working; but isn't release-blocker P0 Issue that blocks the release P0 Issues that should be fixed in short order triage Needs triage (eg: priority, bug/not-bug, and owning component) r2.1-failure labels Oct 13, 2022

rickyyx added this to the Core Nightly/CI Regressions milestone Oct 13, 2022

rickyyx assigned scv119 Oct 13, 2022

scv119 assigned clarkzinzow and unassigned scv119 Oct 14, 2022

c21 removed release-blocker P0 Issue that blocks the release P0 Issues that should be fixed in short order r2.1-failure labels Oct 19, 2022

rickyyx mentioned this issue Oct 27, 2022

[core][release] unstable dataset_shuffle_random_shuffle_1tb #29780

Closed

7 tasks

rickyyx mentioned this issue Oct 28, 2022

[ci][release] Known non-release-blocking issues tracking #29377

Closed

clarng mentioned this issue Nov 1, 2022

[core] default oom killer on for master #29831

Merged

7 tasks

clarng added the P0 Issues that should be fixed in short order label Nov 4, 2022

clarng added core Issues that should be addressed in Ray Core data Ray Data-related issues and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Nov 4, 2022

clarng assigned c21 Nov 7, 2022

clarng removed the core Issues that should be addressed in Ray Core label Nov 7, 2022

clarng assigned clarng and unassigned c21 and clarkzinzow Nov 14, 2022

clarng closed this as completed Nov 14, 2022

xwjiang2010 mentioned this issue Nov 15, 2022

[ci][release] Known non-release-blocking issues tracking #30294

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[release][CI] dataset_shuffle_random_shuffle_1tb failed #29294

[release][CI] dataset_shuffle_random_shuffle_1tb failed #29294

rickyyx commented Oct 13, 2022

scv119 commented Oct 13, 2022

scv119 commented Oct 14, 2022

rickyyx commented Oct 17, 2022

clarkzinzow commented Oct 18, 2022

clarkzinzow commented Oct 18, 2022 •

edited

Loading

rkooo567 commented Oct 18, 2022

clarkzinzow commented Oct 18, 2022

c21 commented Oct 19, 2022

rickyyx commented Oct 25, 2022

jjyao commented Oct 27, 2022

clarng commented Oct 31, 2022

scv119 commented Nov 1, 2022 •

edited

Loading

c21 commented Nov 1, 2022

jjyao commented Nov 1, 2022

clarng commented Nov 7, 2022

c21 commented Nov 8, 2022

clarng commented Nov 14, 2022

[release][CI] dataset_shuffle_random_shuffle_1tb failed #29294

[release][CI] dataset_shuffle_random_shuffle_1tb failed #29294

Comments

rickyyx commented Oct 13, 2022

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Issue Severity

scv119 commented Oct 13, 2022

scv119 commented Oct 14, 2022

rickyyx commented Oct 17, 2022

clarkzinzow commented Oct 18, 2022

clarkzinzow commented Oct 18, 2022 • edited Loading

rkooo567 commented Oct 18, 2022

clarkzinzow commented Oct 18, 2022

c21 commented Oct 19, 2022

rickyyx commented Oct 25, 2022

jjyao commented Oct 27, 2022

clarng commented Oct 31, 2022

scv119 commented Nov 1, 2022 • edited Loading

c21 commented Nov 1, 2022

jjyao commented Nov 1, 2022

clarng commented Nov 7, 2022

c21 commented Nov 8, 2022

clarng commented Nov 14, 2022

clarkzinzow commented Oct 18, 2022 •

edited

Loading

scv119 commented Nov 1, 2022 •

edited

Loading