Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix node manager failure when ClientTable has a disconnected entry. #2905

Merged
merged 7 commits into from
Sep 22, 2018
Merged
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
30 changes: 23 additions & 7 deletions src/ray/raylet/node_manager.cc
Original file line number Diff line number Diff line change
Expand Up @@ -334,21 +334,37 @@ void NodeManager::ClientAdded(const ClientTableDataT &client_data) {
return;
}

ResourceSet resources_total(client_data.resources_total_label,
client_data.resources_total_capacity);
this->cluster_resource_map_.emplace(client_id, SchedulingResources(resources_total));

// Establish a new NodeManager connection to this GCS client.
auto client_info = gcs_client_->client_table().GetClient(client_id);
RAY_LOG(DEBUG) << "[ClientAdded] CONNECTING TO: "
RAY_LOG(DEBUG) << "[ClientAdded] TRY TO CONNECTING TO: "
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RAY_LOG(DEBUG) << "[ClientAdded] Trying to connect to client " << client_id
               << " at " << client_info.node_manager_address << ":"
               << client_info.node_manager_port;

<< " " << client_info.node_manager_address << " "
<< client_info.node_manager_port;

boost::asio::ip::tcp::socket socket(io_service_);
RAY_CHECK_OK(TcpConnect(socket, client_info.node_manager_address,
client_info.node_manager_port));
auto status =
TcpConnect(socket, client_info.node_manager_address, client_info.node_manager_port);
// ClientTable is Log<ActorID, ActorTableData>, a disconnected client has 2 entries.
// The first one has is_insertion=true and the second one has is_insertion=false.
// When a new raylet starts, ClientAdded will be called with the disconnected client's
// first entry, which will cause IOError and "Connection refused".
if (status.code() == StatusCode::IOError) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's probably preferable to just check !status.ok() like you were doing before

// Do we need to check this? If the error message changes this will fail.
RAY_CHECK(status.message() == "Connection refused");
RAY_LOG(WARNING) << "ClientAdded: " << status.message()
<< " (conde=" << static_cast<int>(status.code()) << ")"
<< " which may be caused by a disconnected client.";
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RAY_LOG(WARNING) << "Failed to connect to client " << client_id 
                 << " in ClientAdded. TcpConnect returned status: "
                 << status.ToString() << ". This may be caused by "
                 << "trying to connect to a node manager that has failed.";


return;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Under what circumstances would status not be ok? I agree that failing to connect to the remote client shouldn't be a fatal error, but won't things go wrong later if we don't succeed at connecting to it here?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we return here and don't connect, when will the connection be established?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ClientTableDataT may come from a disconnected client. The connection should never be established.

}

// The client is connected.
RAY_CHECK(status.ok());
auto server_conn = TcpServerConnection(std::move(socket));
remote_server_connections_.emplace(client_id, std::move(server_conn));

ResourceSet resources_total(client_data.resources_total_label,
client_data.resources_total_capacity);
cluster_resource_map_.emplace(client_id, SchedulingResources(resources_total));
}

void NodeManager::ClientRemoved(const ClientTableDataT &client_data) {
Expand Down