Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core][Nightly] datasets_ingest_train_infer failing #26966

Closed
scv119 opened this issue Jul 25, 2022 · 15 comments · Fixed by #27034
Closed

[Core][Nightly] datasets_ingest_train_infer failing #26966

scv119 opened this issue Jul 25, 2022 · 15 comments · Fixed by #27034
Assignees
Labels
bug Something that is supposed to be working; but isn't P1 Issue that should be fixed within a few weeks triage Needs triage (eg: priority, bug/not-bug, and owning component)

Comments

@scv119
Copy link
Contributor

scv119 commented Jul 25, 2022

What happened + What you expected to happen

failing test: https://console.anyscale.com/o/anyscale-internal/projects/prj_2xR6uT6t7jJuu1aCwWMsle/clusters/ses_rFHQEVVAA3KZgi4apbnTAkcd?command-history-section=command_history

Step 1: Dropping nulls, creating new_col, updating feature_1

Map_Batches:  99%|█████████████████████████████████████████████████████████████████████████████████████▏| 396/400 [00:19<00:00, 78.36it/s](pid=gcs_server) E0725 08:34:14.326870515     253 chttp2_transport.cc:2846]   keepalive_ping_end state error: 0 (expect: 1)
(raylet, ip=10.0.0.128) [2022-07-25 08:35:14,805 C 157 212] (raylet) gcs_rpc_client.h:537:  Check failed: absl::ToInt64Seconds(absl::Now() - gcs_last_alive_time_) < ::RayConfig::instance().gcs_rpc_server_reconnect_timeout_s() Failed to connect to GCS within 60 seconds
(raylet, ip=10.0.0.128) *** StackTrace Information ***
(raylet, ip=10.0.0.128) /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x47af5a) [0x55e3772d0f5a] ray::operator<<()
(raylet, ip=10.0.0.128) /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x47ca52) [0x55e3772d2a52] ray::SpdLogMessage::Flush()
(raylet, ip=10.0.0.128) /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x47cd67) [0x55e3772d2d67] ray::RayLog::~RayLog()
(raylet, ip=10.0.0.128) /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x353253) [0x55e3771a9253] ray::rpc::GcsRpcClient::CheckChannelStatus()
(raylet, ip=10.0.0.128) /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x3536c3) [0x55e3771a96c3] boost::asio::detail::wait_handler<>::do_complete()
(raylet, ip=10.0.0.128) /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x9896eb) [0x55e3777df6eb] boost::asio::detail::scheduler::do_run_one()
(raylet, ip=10.0.0.128) /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x98aeb1) [0x55e3777e0eb1] boost::asio::detail::scheduler::run()
(raylet, ip=10.0.0.128) /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x98b0e0) [0x55e3777e10e0] boost::asio::io_context::run()
(raylet, ip=10.0.0.128) /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x9d9fa0) [0x55e37782ffa0] execute_native_thread_routine
(raylet, ip=10.0.0.128) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76db) [0x7f108a3366db] start_thread
(raylet, ip=10.0.0.128) /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f) [0x7f108951861f] __clone
(raylet, ip=10.0.0.128) 
Raylet is terminated: ip=10.0.0.128, id=ab515a6f7fd0794fab61dc4ac46742175cdff63997306a6de835bfb6. Termination is unexpected. Possible reasons include: (1) SIGKILL by the user or system OOM killer, (2) Invalid memory access from Raylet causing SIGSEGV or SIGBUS, (3) Other termination signals. Last 20 lines of the Raylet logs:
    [state-dump]        AgentManagerService.grpc_server.RegisterAgent - 1 total (0 active), CPU time: mean = 291.934 us, total = 291.934 us
    [state-dump]        NodeInfoGcsService.grpc_client.RegisterNode - 1 total (0 active), CPU time: mean = 489.663 us, total = 489.663 us
    [state-dump]        NodeInfoGcsService.grpc_client.GetInternalConfig - 1 total (0 active), CPU time: mean = 1.014 s, total = 1.014 s
    [state-dump] DebugString() time ms: 1
    [state-dump] 
    [state-dump] 
    [2022-07-25 08:35:14,805 C 157 212] (raylet) gcs_rpc_client.h:537:  Check failed: absl::ToInt64Seconds(absl::Now() - gcs_last_alive_time_) < ::RayConfig::instance().gcs_rpc_server_reconnect_timeout_s() Failed to connect to GCS within 60 seconds
    *** StackTrace Information ***
    /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x47af5a) [0x55e3772d0f5a] ray::operator<<()
    /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x47ca52) [0x55e3772d2a52] ray::SpdLogMessage::Flush()
    /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x47cd67) [0x55e3772d2d67] ray::RayLog::~RayLog()
    /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x353253) [0x55e3771a9253] ray::rpc::GcsRpcClient::CheckChannelStatus()
    /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x3536c3) [0x55e3771a96c3] boost::asio::detail::wait_handler<>::do_complete()
    /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x9896eb) [0x55e3777df6eb] boost::asio::detail::scheduler::do_run_one()
    /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x98aeb1) [0x55e3777e0eb1] boost::asio::detail::scheduler::run()
    /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x98b0e0) [0x55e3777e10e0] boost::asio::io_context::run()
    /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x9d9fa0) [0x55e37782ffa0] execute_native_thread_routine
    /lib/x86_64-linux-gnu/libpthread.so.0(+0x76db) [0x7f108a3366db] start_thread
    /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f) [0x7f108951861f] __clone
    

(pid=gcs_server) [2022-07-25 08:35:34,426 E 217 217] (gcs_server) gcs_server.cc:295: Failed to get the resource load: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details: 
Raylet is terminated: ip=10.0.0.150, id=717b4ae7fe19f51c9d16947065f5a1144e034fa386e3da99a3fbda50. Termination is unexpected. Possible reasons include: (1) SIGKILL by the user or system OOM killer, (2) Invalid memory access from Raylet causing SIGSEGV or SIGBUS, (3) Other termination signals. Last 20 lines of the Raylet logs:
    [state-dump]        NodeInfoGcsService.grpc_client.RegisterNode - 1 total (0 active), CPU time: mean = 431.974 us, total = 431.974 us
    [state-dump]        JobInfoGcsService.grpc_client.AddJob - 1 total (0 active), CPU time: mean = 41.452 us, total = 41.452 us
    [state-dump] DebugString() time ms: 1
    [state-dump] 
    [state-dump] 
    [2022-07-25 08:35:36,203 W 273 337] (raylet) node_manager.cc:158: Last heartbeat was sent 8987 ms ago. There might be resource pressure on this node. If heartbeat keeps lagging, this node can be marked as dead mistakenly.
    [2022-07-25 08:35:40,492 C 273 339] (raylet) gcs_rpc_client.h:537:  Check failed: absl::ToInt64Seconds(absl::Now() - gcs_last_alive_time_) < ::RayConfig::instance().gcs_rpc_server_reconnect_timeout_s() Failed to connect to GCS within 60 seconds
    *** StackTrace Information ***
    /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x47af5a) [0x55da47c05f5a] ray::operator<<()
    /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x47ca52) [0x55da47c07a52] ray::SpdLogMessage::Flush()
    /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x47cd67) [0x55da47c07d67] ray::RayLog::~RayLog()
    /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x353253) [0x55da47ade253] ray::rpc::GcsRpcClient::CheckChannelStatus()
    /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x3536c3) [0x55da47ade6c3] boost::asio::detail::wait_handler<>::do_complete()
    /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x9896eb) [0x55da481146eb] boost::asio::detail::scheduler::do_run_one()
    /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x98aeb1) [0x55da48115eb1] boost::asio::detail::scheduler::run()
    /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x98b0e0) [0x55da481160e0] boost::asio::io_context::run()
    /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x9d9fa0) [0x55da48164fa0] execute_native_thread_routine
    /lib/x86_64-linux-gnu/libpthread.so.0(+0x76db) [0x7fbe6cbd56db] start_thread
    /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f) [0x7fbe6bdb761f] __clone
    

(pid=gcs_server) [2022-07-25 08:35:41,696 E 217 217] (gcs_server) gcs_server.cc:295: Failed to get the resource load: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details: 
(pid=gcs_server) [2022-07-25 08:35:41,709 E 217 217] (gcs_server) gcs_server.cc:295: Failed to get the resource load: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details: 
(raylet) [2022-07-25 08:35:40,492 C 273 339] (raylet) gcs_rpc_client.h:537:  Check failed: absl::ToInt64Seconds(absl::Now() - gcs_last_alive_time_) < ::RayConfig::instance().gcs_rpc_server_reconnect_timeout_s() Failed to connect to GCS within 60 seconds
(raylet) *** StackTrace Information ***
(raylet) /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x47af5a) [0x55da47c05f5a] ray::operator<<()
(raylet) /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x47ca52) [0x55da47c07a52] ray::SpdLogMessage::Flush()
(raylet) /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x47cd67) [0x55da47c07d67] ray::RayLog::~RayLog()
(raylet) /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x353253) [0x55da47ade253] ray::rpc::GcsRpcClient::CheckChannelStatus()
(raylet) /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x3536c3) [0x55da47ade6c3] boost::asio::detail::wait_handler<>::do_complete()
(raylet) /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x9896eb) [0x55da481146eb] boost::asio::detail::scheduler::do_run_one()
(raylet) /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x98aeb1) [0x55da48115eb1] boost::asio::detail::scheduler::run()
(raylet) /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x98b0e0) [0x55da481160e0] boost::asio::io_context::run()
(raylet) /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x9d9fa0) [0x55da48164fa0] execute_native_thread_routine
(raylet) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76db) [0x7fbe6cbd56db] start_thread
(raylet) /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f) [0x7fbe6bdb761f] __clone
(raylet) 
(pid=gcs_server) [2022-07-25 08:35:42,584 E 217 217] (gcs_server) gcs_server.cc:295: Failed to get the resource load: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details: 
(pid=gcs_server) [2022-07-25 08:35:42,584 E 217 217] (gcs_server) gcs_server.cc:295: Failed to get the resource load: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details: 
(pid=gcs_server) [2022-07-25 08:35:43,587 E 217 217] (gcs_server) gcs_server.cc:295: Failed to get the resource load: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details: 
(pid=gcs_server) [2022-07-25 08:35:43,588 E 217 217] (gcs_server) gcs_server.cc:295: Failed to get the resource load: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details: 
(pid=gcs_server) [2022-07-25 08:35:44,590 E 217 217] (gcs_server) gcs_server.cc:295: Failed to get the resource load: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details: 
(pid=gcs_server) [2022-07-25 08:35:44,590 E 217 217] (gcs_server) gcs_server.cc:295: Failed to get the resource load: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details: 
(pid=gcs_server) [2022-07-25 08:35:45,593 E 217 217] (gcs_server) gcs_server.cc:295: Failed to get the resource load: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details: 
(pid=gcs_server) [2022-07-25 08:35:45,593 E 217 217] (gcs_server) gcs_server.cc:295: Failed to get the resource load: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details: 
(pid=gcs_server) [2022-07-25 08:35:46,595 E 217 217] (gcs_server) gcs_server.cc:295: Failed to get the resource load: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details: 
(pid=gcs_server) [2022-07-25 08:35:46,595 E 217 217] (gcs_server) gcs_server.cc:295: Failed to get the resource load: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details: 
(pid=gcs_server) [2022-07-25 08:35:47,598 E 217 217] (gcs_server) gcs_server.cc:295: Failed to get the resource load: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details: 
(pid=gcs_server) [2022-07-25 08:35:47,599 E 217 217] (gcs_server) gcs_server.cc:295: Failed to get the resource load: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details: 
(pid=gcs_server) [2022-07-25 08:35:48,601 E 217 217] (gcs_server) gcs_server.cc:295: Failed to get the resource load: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details: 
(pid=gcs_server) [2022-07-25 08:35:48,601 E 217 217] (gcs_server) gcs_server.cc:295: Failed to get the resource load: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details: 
Traceback (most recent call last):
  File "ray_sgd_training.py", line 524, in <module>
    read_dataset(data_path)
  File "ray_sgd_training.py", line 47, in preprocess_train_data
    return self._preprocess(ds, False)
  File "ray_sgd_training.py", line 75, in _preprocess
    ds = ds.map_batches(batch_transformer, batch_format="pandas")
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/data/dataset.py", line 499, in map_batches
    return Dataset(plan, self._epoch, self._lazy)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/data/dataset.py", line 207, in __init__
    self._plan.execute(allow_clear_input_blocks=False)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/data/_internal/plan.py", line 309, in execute
    blocks, clear_input_blocks, self._run_by_consumer
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/data/_internal/plan.py", line 672, in __call__
    fn_constructor_kwargs=self.fn_constructor_kwargs,
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/data/_internal/compute.py", line 113, in _apply
    results = map_bar.fetch_until_complete(refs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/data/_internal/progress_bar.py", line 75, in fetch_until_complete
    for ref, result in zip(done, ray.get(done)):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/worker.py", line 2233, in get
    values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/worker.py", line 668, in get_objects
    object_refs, self.current_task_id, timeout_ms
  File "python/ray/_raylet.pyx", line 1219, in ray._raylet.CoreWorker.get_objects
  File "python/ray/_raylet.pyx", line 179, in ray._raylet.check_status
ray.exceptions.RaySystemError: System error: Connection reset by peer

Versions / Dependencies

latest ray

Reproduction script

run nightly test

Issue Severity

High: It blocks me from completing my task.

@scv119 scv119 added bug Something that is supposed to be working; but isn't release-blocker P0 Issue that blocks the release P0 Issues that should be fixed in short order triage Needs triage (eg: priority, bug/not-bug, and owning component) Ray 2.0 labels Jul 25, 2022
@scv119 scv119 added this to the Core Nightly/CI Regressions milestone Jul 25, 2022
@scv119
Copy link
Contributor Author

scv119 commented Jul 25, 2022

git log --oneline 8fe4399..bf97a69
bf97a69 [Dashboard] Actor Table UI Optimize (#26785)
4d6cbb0 [Java]More efficient getAllNodeInfo() (#26872)
abde2a5 [tune] Fix current best trial progress string for metric=0 (#26943)
4a1ad3e [Workflow] Support "retry_exceptions" of Ray tasks (#26913)
a012033 [ci] pin werkzeug (#26950)
15b711a [State Observability] Warn if callsite is disabled when ray list objects + raise exception on missing output (#26880)
1ac2a87 [docs] Editing pass over Dataset docs (#26935)
d01a80e [core] runtime context resource ids getter (#26907)
acbab51 [Nightly] fix microbenchmark scripts (#26947)
0c16619 [core] Make ray able to connect to redis without pip redis. (#25875)
8d7b865 [air/tuner/docs] Update docs for Tuner() API 2a: Tune examples (non-docs) (#26931)
803c094 [air/tuner/docs] Update docs for Tuner() API 2b: Tune examples (ipynb) (#26884)
008eecf [docs] Update the AIR data ingest guide (#26909)
e19cf16 [Datasets] Use sampling to estimate in-memory data size for Parquet data source (#26868)

8d7b865 is passing https://console.anyscale.com/o/anyscale-internal/configurations/app-config-details/bld_pkgTMjHxMk6ZDnwDDg3Athfa

@scv119
Copy link
Contributor Author

scv119 commented Jul 26, 2022

@scv119
Copy link
Contributor Author

scv119 commented Jul 26, 2022

@scv119 scv119 self-assigned this Jul 26, 2022
@scv119
Copy link
Contributor Author

scv119 commented Jul 26, 2022

@scv119
Copy link
Contributor Author

scv119 commented Jul 27, 2022

@scv119
Copy link
Contributor Author

scv119 commented Jul 27, 2022

I think i'm 95% confident e19cf16 is causing the regression. most likely because we have too many objects.

@rkooo567
Copy link
Contributor

rkooo567 commented Jul 27, 2022

I think after e19cf16 is addressed, we should still work on improving the robustness of GCS in the same situation? (as a P1 issue?)

@rkooo567 rkooo567 added P1 Issue that should be fixed within a few weeks and removed P0 Issues that should be fixed in short order labels Jul 27, 2022
@rkooo567 rkooo567 reopened this Jul 27, 2022
@rkooo567
Copy link
Contributor

Reopen for the further investigation on GCS disconnection. TL;DR Raylets should never be disconnected like that although there are some application loads e19cf16 caused. We should investigate the root cause and make that part more robust.

@rkooo567
Copy link
Contributor

Btw cc @iycheng I think this error

(raylet) [2022-07-25 08:35:40,492 C 273 339] (raylet) gcs_rpc_client.h:537:  Check failed: absl::ToInt64Seconds(absl::Now() - gcs_last_alive_time_) < ::RayConfig::instance().gcs_rpc_server_reconnect_timeout_s() Failed to connect to GCS within 60 seconds

means our gcs death detecting logic probably has a bug. I think this check failure happens when GCS is actually alive. Looking at code, some potential issues are

  1. if states oscillate between GRPC_CHANNEL_CONNECTING and others, but if it is always GRPC_CHANNEL_CONNECTING when the periodic CheckChannelStatus is called, it can think connection wasn't established for a long time although it did.
  2. Simply GCS is too overloaded and couldn't make a connection. I think we should tweak some of keepalive settings & idle connection timeout settings from client & server to reduce the load.
  3. Some RPCs have the gRPC error (it seems like when there are gRPC error, we just think GCS is dead).

Additionally, I think we should also make gcs_is_down_ = false when the state is GRPC_CHANNEL_READY?

case GRPC_CHANNEL_READY:

@rkooo567 rkooo567 assigned fishbone and unassigned scv119 Jul 28, 2022
@rkooo567
Copy link
Contributor

We talked in person, and we agreed we need some tweaks in detection logic + grpc config. Yi will look at it after rc1 deadline.

@scv119 scv119 removed the release-blocker P0 Issue that blocks the release label Jul 29, 2022
@scv119
Copy link
Contributor Author

scv119 commented Jul 29, 2022

@rkooo567 now it's no longer release-blocker right?

@scv119 scv119 closed this as completed Sep 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't P1 Issue that should be fixed within a few weeks triage Needs triage (eg: priority, bug/not-bug, and owning component)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants