Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix node manager failure when ClientTable has a disconnected entry. #2905

Merged
merged 7 commits into from
Sep 22, 2018
Merged
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
22 changes: 15 additions & 7 deletions src/ray/raylet/node_manager.cc
Original file line number Diff line number Diff line change
Expand Up @@ -334,21 +334,29 @@ void NodeManager::ClientAdded(const ClientTableDataT &client_data) {
return;
}

ResourceSet resources_total(client_data.resources_total_label,
client_data.resources_total_capacity);
this->cluster_resource_map_.emplace(client_id, SchedulingResources(resources_total));

// Establish a new NodeManager connection to this GCS client.
auto client_info = gcs_client_->client_table().GetClient(client_id);
RAY_LOG(DEBUG) << "[ClientAdded] CONNECTING TO: "
RAY_LOG(DEBUG) << "[ClientAdded] TRY TO CONNECTING TO: "
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RAY_LOG(DEBUG) << "[ClientAdded] Trying to connect to client " << client_id
               << " at " << client_info.node_manager_address << ":"
               << client_info.node_manager_port;

<< " " << client_info.node_manager_address << " "
<< client_info.node_manager_port;

boost::asio::ip::tcp::socket socket(io_service_);
RAY_CHECK_OK(TcpConnect(socket, client_info.node_manager_address,
client_info.node_manager_port));
auto status =
TcpConnect(socket, client_info.node_manager_address, client_info.node_manager_port);
// ClientTable is Log<ActorID, ActorTableData>, which has multiple entries for
// a disconnected client. The first one has is_insertion=true and the second
// one has is_insertion=false. We must make sure this is not a close client.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused about this comment. ClientAdded should never be called if is_insertion=true. The case is_insertion=true should be handled by ClientRemoved, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you are right ClientAdded would never be called with is_insertion=false. The comments is misleading.

if (!status.ok()) {
return;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Under what circumstances would status not be ok? I agree that failing to connect to the remote client shouldn't be a fatal error, but won't things go wrong later if we don't succeed at connecting to it here?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we return here and don't connect, when will the connection be established?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ClientTableDataT may come from a disconnected client. The connection should never be established.

}

// The client is connected.
auto server_conn = TcpServerConnection(std::move(socket));
remote_server_connections_.emplace(client_id, std::move(server_conn));

ResourceSet resources_total(client_data.resources_total_label,
client_data.resources_total_capacity);
this->cluster_resource_map_.emplace(client_id, SchedulingResources(resources_total));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's avoid using this (I realize this is copied from above).

}

void NodeManager::ClientRemoved(const ClientTableDataT &client_data) {
Expand Down