-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix node manager failure when ClientTable has a disconnected entry. #2905
Conversation
Test FAILed. |
Test PASSed. |
So just to make sure I understand, previously we would optimistically update resource tracking before we actually verified that the client was connected? And now we would update the resource tracking only after connection check returns True? Is it possible to add a test for this? |
src/ray/raylet/node_manager.cc
Outdated
TcpConnect(socket, client_info.node_manager_address, client_info.node_manager_port); | ||
// ClientTable is Log<ActorID, ActorTableData>, which has multiple entries for | ||
// a disconnected client. The first one has is_insertion=true and the second | ||
// one has is_insertion=false. We must make sure this is not a close client. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused about this comment. ClientAdded
should never be called if is_insertion=true
. The case is_insertion=true
should be handled by ClientRemoved
, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you are right ClientAdded
would never be called with is_insertion=false
. The comments is misleading.
src/ray/raylet/node_manager.cc
Outdated
// a disconnected client. The first one has is_insertion=true and the second | ||
// one has is_insertion=false. We must make sure this is not a close client. | ||
if (!status.ok()) { | ||
return; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Under what circumstances would status
not be ok? I agree that failing to connect to the remote client shouldn't be a fatal error, but won't things go wrong later if we don't succeed at connecting to it here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we return here and don't connect, when will the connection be established?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ClientTableDataT
may come from a disconnected client. The connection should never be established.
src/ray/raylet/node_manager.cc
Outdated
|
||
ResourceSet resources_total(client_data.resources_total_label, | ||
client_data.resources_total_capacity); | ||
this->cluster_resource_map_.emplace(client_id, SchedulingResources(resources_total)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's avoid using this
(I realize this is copied from above).
@robertnishihara @richardliaw By my understanding, when a raylet disconnected, it will write a message to ClientTable with Another way to solve this could be cleaning up both entries for the disconnected client. |
Sorry to mis-express the understanding. For a entry with The problem is that, after a raylet client disconnected, it only append an entry with |
Test PASSed. |
This seems related to #2852 |
Test PASSed. |
Test PASSed. |
@robertnishihara Is this PR OK for you? Is |
@guoyuhong I agree it feels fragile. I removed it. I think this is pretty good now. Would you be able to add a test in a follow up PR? E.g., the test should go in
What do you think about something like that? |
src/ray/raylet/node_manager.cc
Outdated
// The first one has is_insertion=true and the second one has is_insertion=false. | ||
// When a new raylet starts, ClientAdded will be called with the disconnected client's | ||
// first entry, which will cause IOError and "Connection refused". | ||
if (status.code() == StatusCode::IOError) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's probably preferable to just check !status.ok()
like you were doing before
src/ray/raylet/node_manager.cc
Outdated
RAY_CHECK(status.message() == "Connection refused"); | ||
RAY_LOG(WARNING) << "ClientAdded: " << status.message() | ||
<< " (conde=" << static_cast<int>(status.code()) << ")" | ||
<< " which may be caused by a disconnected client."; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RAY_LOG(WARNING) << "Failed to connect to client " << client_id
<< " in ClientAdded. TcpConnect returned status: "
<< status.ToString() << ". This may be caused by "
<< "trying to connect to a node manager that has failed.";
src/ray/raylet/node_manager.cc
Outdated
// Establish a new NodeManager connection to this GCS client. | ||
auto client_info = gcs_client_->client_table().GetClient(client_id); | ||
RAY_LOG(DEBUG) << "[ClientAdded] CONNECTING TO: " | ||
RAY_LOG(DEBUG) << "[ClientAdded] TRY TO CONNECTING TO: " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RAY_LOG(DEBUG) << "[ClientAdded] Trying to connect to client " << client_id
<< " at " << client_info.node_manager_address << ":"
<< client_info.node_manager_port;
@robertnishihara Yes. I will add the test in later PR. |
Test PASSed. |
What do these changes do?
ClientTable is Log<ActorID, ActorTableData>, which has multiple entries for a disconnected client. The first one has is_insertion=true and the second one has is_insertion=false. We must make sure this is not a closed client.
When a new raylet starts, the ClientAdded will be called with the disconnected client data. However, since the client was closed, the connection will fail. The following message shows the example log:
Client
8fcca9d6b3122effeb2424faa315713b79829ca9
and397783a6eddf33c74007a4cbabb156cbcfd0220d
has 2 entries.Related issue number
#2878