Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[release][CI] dataset_shuffle_random_shuffle_1tb failed #29294

Closed
rickyyx opened this issue Oct 13, 2022 · 17 comments
Closed

[release][CI] dataset_shuffle_random_shuffle_1tb failed #29294

rickyyx opened this issue Oct 13, 2022 · 17 comments
Assignees
Labels
bug Something that is supposed to be working; but isn't data Ray Data-related issues P0 Issues that should be fixed in short order

Comments

@rickyyx
Copy link
Contributor

rickyyx commented Oct 13, 2022

What happened + What you expected to happen

Build

Cluster


Traceback (most recent call last):
  File "dataset/sort.py", line 165, in <module>
    raise exc
  File "dataset/sort.py", line 117, in <module>
    ds = ds.random_shuffle()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/data/dataset.py", line 857, in random_shuffle
    return Dataset(plan, self._epoch, self._lazy)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/data/dataset.py", line 219, in __init__
    self._plan.execute(allow_clear_input_blocks=False)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/data/_internal/plan.py", line 310, in execute
    blocks, clear_input_blocks, self._run_by_consumer
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/data/_internal/plan.py", line 765, in __call__
    blocks, clear_input_blocks, self.block_udf, self.ray_remote_args
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/data/_internal/stage_impl.py", line 118, in do_shuffle
    reduce_ray_remote_args=remote_args,
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/data/_internal/shuffle.py", line 117, in execute
    new_metadata = reduce_bar.fetch_until_complete(list(new_metadata))
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/data/_internal/progress_bar.py", line 75, in fetch_until_complete
    for ref, result in zip(done, ray.get(done)):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/worker.py", line 2289, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError: ray::reduce() (pid=812, ip=172.31.116.182)
  At least one of the input arguments for this task could not be computed:
ray.exceptions.ObjectFetchTimedOutError: Failed to retrieve object a634bb3b3ca530f0ffffffffffffffffffffffff0200000002000000. To see information about where this ObjectRef was created in Python, set the environment variable RAY_record_ref_creation_sites=1 during `ray start` and `ray.init()`.

Fetch for object a634bb3b3ca530f0ffffffffffffffffffffffff0200000002000000 timed out because no locations were found for the object. This may indicate a system-level bug.

But looks like this might be root cause:

worker.py:1839 -- The node with node id: cfa8ba8a1da67a2ca0d324cfa1f25379459d4e4595534bc9339d572f and address: 172.31.127.55 and node name: 172.31.127.55 has been marked dead because the detector has missed too many heartbeats from it. This can happen when a        (1) raylet crashes unexpectedly (OOM, preempted node, etc.) 
        (2) raylet has lagging heartbeats due to slow network or busy workload.

Shuffle Map:  58%|█████▊    | 583/1000 [13:12<16:19,  2.35s/it]  2022-10-12 13:13:42,685        WARNING worker.py:1839 -- A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: 80725d1404687f199f85f783b5726797df932a6b02000000 Worker ID: f5ad619ac9eb377ba1940e4aeee99e325b05a9974c0e9294bac3cb2c Node ID: 333d58f0360b87d85caac78f08855930c3209c04eb5f819d2ee63610 Worker IP address: 172.31.119.129 Worker port: 10005 Worker PID: 1049 Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
(raylet, ip=172.31.127.55) [2022-10-12 13:13:42,754 C 79 132] (raylet) node_manager.cc:173: This node has beem marked as dead.
(raylet, ip=172.31.127.55) *** StackTrace Information ***
(raylet, ip=172.31.127.55) /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x49bd1a) [0x563db3153d1a] ray::operator<<()
(raylet, ip=172.31.127.55) /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x49d7f2) [0x563db31557f2] ray::SpdLogMessage::Flush()
(raylet, ip=172.31.127.55) /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x49db07) [0x563db3155b07] ray::RayLog::~RayLog()
(raylet, ip=172.31.127.55) /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x242464) [0x563db2efa464] std::_Function_handler<>::_M_invoke()
(raylet, ip=172.31.127.55) /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x375bc4) [0x563db302dbc4] std::_Function_handler<>::_M_invoke()
(raylet, ip=172.31.127.55) /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x3cabd0) [0x563db3082bd0] ray::rpc::GcsRpcClient::ReportHeartbeat()::{lambda()#2}::operator()()
(raylet, ip=172.31.127.55) /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x373a32) [0x563db302ba32] ray::rpc::ClientCallImpl<>::OnReplyReceived()
(raylet, ip=172.31.127.55) /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x2290b5) [0x563db2ee10b5] std::_Function_handler<>::_M_invoke()
(raylet, ip=172.31.127.55) /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x47fb46) [0x563db3137b46] EventTracker::RecordExecution()
(raylet, ip=172.31.127.55) /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x42030e) [0x563db30d830e] std::_Function_handler<>::_M_invoke()
(raylet, ip=172.31.127.55) /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x420786) [0x563db30d8786] boost::asio::detail::completion_handler<>::do_complete()
(raylet, ip=172.31.127.55) /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x9ada0b) [0x563db3665a0b] boost::asio::detail::scheduler::do_run_one()
(raylet, ip=172.31.127.55) /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x9af1d1) [0x563db36671d1] boost::asio::detail::scheduler::run()
(raylet, ip=172.31.127.55) /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x9af400) [0x563db3667400] boost::asio::io_context::run()
(raylet, ip=172.31.127.55) /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x9fe110) [0x563db36b6110] execute_native_thread_routine
(raylet, ip=172.31.127.55) /lib/x86_64-linux-gnu/libpthread.so.0(+0x8609) [0x7ff78c649609] start_thread
(raylet, ip=172.31.127.55) /lib/x86_64-linux-gnu/libc.so.6(clone+0x43) [0x7ff78c218133] __clone
(raylet, ip=172.31.127.55) 

Versions / Dependencies

Release 2.1 master

Reproduction script

NA

Issue Severity

No response

@rickyyx rickyyx added bug Something that is supposed to be working; but isn't release-blocker P0 Issue that blocks the release P0 Issues that should be fixed in short order triage Needs triage (eg: priority, bug/not-bug, and owning component) r2.1-failure labels Oct 13, 2022
@rickyyx rickyyx added this to the Core Nightly/CI Regressions milestone Oct 13, 2022
@scv119
Copy link
Contributor

scv119 commented Oct 13, 2022

for this test, seems raylet overloaded and failure is not uncommon; however previously the test could handle these node failures (most likely mitigated by lineage reconstruction)

@scv119
Copy link
Contributor

scv119 commented Oct 14, 2022

@clarkzinzow i'm a bit overloaded right now (5 total release blocker). Mind take this one?

@scv119 scv119 assigned clarkzinzow and unassigned scv119 Oct 14, 2022
@rickyyx
Copy link
Contributor Author

rickyyx commented Oct 17, 2022

Might be due to agent memory leak #29199

@clarkzinzow
Copy link
Contributor

I've been looking into this, but taking a step back, it looks like this release test has not had a good success rate since August: https://b534fd88.us1a.app.preset.io/superset/dashboard/19/?native_filters_key=V5NMHssrZzj86978-asD6zeCWfwL9FF61U2d4tZ_PE3wDDt6-2hTI6MMw_UcRjIK

@clarkzinzow
Copy link
Contributor

clarkzinzow commented Oct 18, 2022

Most recent failure looks to be due to failing to connect to the GCS, making the agent memory leak causing the head node to get OOMKilled a bit more likely: https://console.anyscale.com/o/anyscale-internal/projects/prj_FKRmeV5pA6X72aVscFALNC32/clusters/ses_GhqMXjBcR8nutFSsTCc8Z7yp?command-history-section=command_history

    [2022-10-17 16:29:03,538 C 243 243] (raylet) gcs_rpc_client.h:537:  Check failed: absl::ToInt64Seconds(absl::Now() - gcs_last_alive_time_) < ::RayConfig::instance().gcs_rpc_server_reconnect_timeout_s() Failed to connect to GCS within 60 seconds

@rkooo567
Copy link
Contributor

I think it is unlikely it's related to agent leak. The agent leak is not that fast (it fails within 8 minutes. Agent leak is about 1GB / h).

@clarkzinzow
Copy link
Contributor

@rkooo567 agreed!

@c21
Copy link
Contributor

c21 commented Oct 19, 2022

This test looks flaky, get a succeessful run on release branch later - https://buildkite.com/ray-project/release-tests-branch/builds/1123#0183d941-4949-4929-b3c8-233d0ca816e1 . Removed as release blocker.

@c21 c21 removed release-blocker P0 Issue that blocks the release P0 Issues that should be fixed in short order r2.1-failure labels Oct 19, 2022
@rickyyx
Copy link
Contributor Author

rickyyx commented Oct 25, 2022

Should we declare this unstable? cc @scv119

@jjyao
Copy link
Collaborator

jjyao commented Oct 27, 2022

@scv119 I think it should be P0 for core in 2.2.

@clarng
Copy link
Contributor

clarng commented Oct 31, 2022

should we try the oom killer if the node is timing out due to oom / freezing?

@scv119
Copy link
Contributor

scv119 commented Nov 1, 2022

can data team take over this test? @c21 @matthewdeng
after some investigation I think this test regression is highly correlated with our infra memory capacity change: now we only allocate 90% of host memory to ray container and it increases the likelyhood of ray running into OOM issues.
As an evidence a commit (74f28f9) that was previously stable also fails in my recent tests https://console.anyscale.com/o/anyscale-internal/projects/prj_FKRmeV5pA6X72aVscFALNC32/clusters/ses_trJuhJfC4WH38x5ziRuzsxCz?command-history-section=command_history

@c21
Copy link
Contributor

c21 commented Nov 1, 2022

Hi @scv119 - yes I think data team should own it. We should work with core to come up action items.

@jjyao
Copy link
Collaborator

jjyao commented Nov 1, 2022

Feel we should try oom killer or use a bigger machine?

@clarng clarng added the P0 Issues that should be fixed in short order label Nov 4, 2022
@clarng clarng added core Issues that should be addressed in Ray Core data Ray Data-related issues and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Nov 4, 2022
@clarng
Copy link
Contributor

clarng commented Nov 7, 2022

@c21 do you mind taking the action item ? Seems Clark already has another P0

@clarng clarng removed the core Issues that should be addressed in Ray Core label Nov 7, 2022
@c21
Copy link
Contributor

c21 commented Nov 8, 2022

@clarng - sure, I will take a look later.

@clarng clarng assigned clarng and unassigned c21 and clarkzinzow Nov 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't data Ray Data-related issues P0 Issues that should be fixed in short order
Projects
None yet
Development

No branches or pull requests

7 participants