-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core] ObjectStore fail to pull object, possibly because node info is missing #32046
Comments
@stephanie-wang @mwtian @rkooo567 Could you help analyze this issue? |
@MissiontoMars are you using redis based pubsub, or OSS one? |
Sorry for the late reply. No additional option configuration was added when launching the ray cluster, so i guess it is gcs pubsub. The problem may be related to the high CPU load of dashboard agent, I found that it was mostly dealing with metrics, then i disable it to reduce cpu usage. Since then the problem has barely arisen. |
@MissiontoMars the agent cpu usage should have been fixed in the recent version (from 2.1) |
One question; when this happens, does it hang forever, or does it eventually resolve? I wonder if it is the data loss or slowdown |
it hang forever |
about to merge soon. |
Why are these changes needed? #32046 indicating that the pubsub might lose data, especially when the subscriber is under load. After examine the protocol it seems one bug is that the publisher fails to handle publish failures. i.e. when we push message in mailbox, we will delete the message being sent regardless of RPC failures. This PR tries to address the problem by adding monotonically increasing sequence_id to each message, and only delete messages when the subscriber acknowledged a message has been received. The sequence_id sequences is also generated per publisher, regardless of channels. This means if there exists multiple channels for the same publisher, each channel might not see contiguous sequences. This also assumes the invariant that a subscriber object will only subscribe to one publisher. We also relies on the pubsub protocol that at most one going push request will be inflight. This also handles the case gcs failover. We do so by track the publisher_id between both publisher and subscriber. When gcs failover, the publisher_id will be different, thus both the publisher and subscriber will forget the information about previous state.
Why are these changes needed? ray-project#32046 indicating that the pubsub might lose data, especially when the subscriber is under load. After examine the protocol it seems one bug is that the publisher fails to handle publish failures. i.e. when we push message in mailbox, we will delete the message being sent regardless of RPC failures. This PR tries to address the problem by adding monotonically increasing sequence_id to each message, and only delete messages when the subscriber acknowledged a message has been received. The sequence_id sequences is also generated per publisher, regardless of channels. This means if there exists multiple channels for the same publisher, each channel might not see contiguous sequences. This also assumes the invariant that a subscriber object will only subscribe to one publisher. We also relies on the pubsub protocol that at most one going push request will be inflight. This also handles the case gcs failover. We do so by track the publisher_id between both publisher and subscriber. When gcs failover, the publisher_id will be different, thus both the publisher and subscriber will forget the information about previous state. Signed-off-by: elliottower <[email protected]>
Why are these changes needed? ray-project#32046 indicating that the pubsub might lose data, especially when the subscriber is under load. After examine the protocol it seems one bug is that the publisher fails to handle publish failures. i.e. when we push message in mailbox, we will delete the message being sent regardless of RPC failures. This PR tries to address the problem by adding monotonically increasing sequence_id to each message, and only delete messages when the subscriber acknowledged a message has been received. The sequence_id sequences is also generated per publisher, regardless of channels. This means if there exists multiple channels for the same publisher, each channel might not see contiguous sequences. This also assumes the invariant that a subscriber object will only subscribe to one publisher. We also relies on the pubsub protocol that at most one going push request will be inflight. This also handles the case gcs failover. We do so by track the publisher_id between both publisher and subscriber. When gcs failover, the publisher_id will be different, thus both the publisher and subscriber will forget the information about previous state. Signed-off-by: Jack He <[email protected]>
Why are these changes needed? ray-project#32046 indicating that the pubsub might lose data, especially when the subscriber is under load. After examine the protocol it seems one bug is that the publisher fails to handle publish failures. i.e. when we push message in mailbox, we will delete the message being sent regardless of RPC failures. This PR tries to address the problem by adding monotonically increasing sequence_id to each message, and only delete messages when the subscriber acknowledged a message has been received. The sequence_id sequences is also generated per publisher, regardless of channels. This means if there exists multiple channels for the same publisher, each channel might not see contiguous sequences. This also assumes the invariant that a subscriber object will only subscribe to one publisher. We also relies on the pubsub protocol that at most one going push request will be inflight. This also handles the case gcs failover. We do so by track the publisher_id between both publisher and subscriber. When gcs failover, the publisher_id will be different, thus both the publisher and subscriber will forget the information about previous state.
What happened + What you expected to happen
As mentioned here: https://discuss.ray.io/t/raylet-object-manager-cc-couldnt-send-pull-request-from/9027, we meet problems about pulling remote object in our production environment.
Our production ray cluster: 240 work nodes and 1400 actors totally, ray version 2.0.0(without any code modified of raycore )
NOTE: In order to describe the problem conveniently, the following use nodeA to represent
d9969738fb6ac4cb998e1b12a4d8acfea969cd2ab45a7cc6c7fda954
, and nodeB to representfd54f76b74986c8e913dbab01a94b9c13881e98981bf0f417a3a62d3
.After diving into some code(raylet and gcs) and log, we found that it is possible that the gcs pub node info message is missing.
ray/src/ray/object_manager/object_manager.cc
Line 293 in 2947e23
The
Couldn't send pull request from
log means that the rpc client from nodeA to nodeB is null.ray/src/ray/object_manager/object_manager.cc
Line 704 in 2947e23
It seems that nodeA cannot get connection info of nodeB.
go on
ray/src/ray/object_manager/ownership_based_object_directory.cc
Line 451 in 2947e23
ray/src/ray/gcs/gcs_client/accessor.cc
Line 536 in 2947e23
The nodeB does not exists in the local node cache.
Then, check raylet.out of nodeA, there is not log like
Received notification for node id
for nodeB.ray/src/ray/gcs/gcs_client/accessor.cc
Line 608 in 2947e23
From gcs log, nodeB should be registered normally.
So obviously, nodeA didn't receive node info of nodeB from gcs. Meanwhile, we check other nodes, such as nodeC, node info of nodeB is received.
Versions / Dependencies
Version: ray 2.0.0
Reproduction script
None
Issue Severity
None
The text was updated successfully, but these errors were encountered: