Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix node manager failure when ClientTable has a disconnected entry. #2905

Merged
merged 7 commits into from
Sep 22, 2018

Conversation

guoyuhong
Copy link
Contributor

@guoyuhong guoyuhong commented Sep 18, 2018

What do these changes do?

ClientTable is Log<ActorID, ActorTableData>, which has multiple entries for a disconnected client. The first one has is_insertion=true and the second one has is_insertion=false. We must make sure this is not a closed client.

When a new raylet starts, the ClientAdded will be called with the disconnected client data. However, since the client was closed, the connection will fail. The following message shows the example log:

I0918 17:16:41.168637 2961466240 tables.cc:335] ClientTable::HandleNotification with client: 0e92265aedc9194e66469a0743558f32e77e86f5 is_new=1 is_insertion=1
I0918 17:16:41.170995 2961466240 node_manager.cc:317] [ClientAdded] received callback from client id 0e92265aedc9194e66469a0743558f32e77e86f5
I0918 17:16:41.171082 2961466240 node_manager.cc:328] a new client: 0e92265aedc9194e66469a0743558f32e77e86f5
I0918 17:16:41.171145 2961466240 node_manager.cc:343] [ClientAdded] CONNECTING TO:  30.50.100.140 61838
I0918 17:16:41.171489 2961466240 tables.cc:335] ClientTable::HandleNotification with client: 8fcca9d6b3122effeb2424faa315713b79829ca9 is_new=1 is_insertion=1
I0918 17:16:41.171558 2961466240 node_manager.cc:317] [ClientAdded] received callback from client id 8fcca9d6b3122effeb2424faa315713b79829ca9
I0918 17:16:41.171686 2961466240 node_manager.cc:328] a new client: 8fcca9d6b3122effeb2424faa315713b79829ca9
I0918 17:16:41.171814 2961466240 node_manager.cc:343] [ClientAdded] CONNECTING TO:  30.50.100.140 61881
I0918 17:16:41.172071 2961466240 tables.cc:335] ClientTable::HandleNotification with client: 397783a6eddf33c74007a4cbabb156cbcfd0220d is_new=1 is_insertion=1
I0918 17:16:41.172159 2961466240 node_manager.cc:317] [ClientAdded] received callback from client id 397783a6eddf33c74007a4cbabb156cbcfd0220d
I0918 17:16:41.172209 2961466240 node_manager.cc:328] a new client: 397783a6eddf33c74007a4cbabb156cbcfd0220d
I0918 17:16:41.172261 2961466240 node_manager.cc:343] [ClientAdded] CONNECTING TO:  30.50.100.140 61923
I0918 17:16:41.172420 2961466240 tables.cc:335] ClientTable::HandleNotification with client: 8fcca9d6b3122effeb2424faa315713b79829ca9 is_new=1 is_insertion=0
I0918 17:16:41.172549 2961466240 node_manager.cc:361] [ClientRemoved] received callback from client id 8fcca9d6b3122effeb2424faa315713b79829ca9
I0918 17:16:41.172675 2961466240 tables.cc:335] ClientTable::HandleNotification with client: 397783a6eddf33c74007a4cbabb156cbcfd0220d is_new=1 is_insertion=0
I0918 17:16:41.172762 2961466240 node_manager.cc:361] [ClientRemoved] received callback from client id 397783a6eddf33c74007a4cbabb156cbcfd0220d
I0918 17:16:41.172847 2961466240 tables.cc:335] ClientTable::HandleNotification with client: e3c2435282712dca6d33f715aa7e3a4847a33f50 is_new=1 is_insertion=1
I0918 17:16:41.172936 2961466240 node_manager.cc:317] [ClientAdded] received callback from client id e3c2435282712dca6d33f715aa7e3a4847a33f50

Client 8fcca9d6b3122effeb2424faa315713b79829ca9 and 397783a6eddf33c74007a4cbabb156cbcfd0220d has 2 entries.

Related issue number

#2878

@guoyuhong
Copy link
Contributor Author

@richardliaw

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/8277/
Test FAILed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/8278/
Test PASSed.

@richardliaw
Copy link
Contributor

So just to make sure I understand, previously we would optimistically update resource tracking before we actually verified that the client was connected? And now we would update the resource tracking only after connection check returns True?

Is it possible to add a test for this?

TcpConnect(socket, client_info.node_manager_address, client_info.node_manager_port);
// ClientTable is Log<ActorID, ActorTableData>, which has multiple entries for
// a disconnected client. The first one has is_insertion=true and the second
// one has is_insertion=false. We must make sure this is not a close client.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused about this comment. ClientAdded should never be called if is_insertion=true. The case is_insertion=true should be handled by ClientRemoved, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you are right ClientAdded would never be called with is_insertion=false. The comments is misleading.

// a disconnected client. The first one has is_insertion=true and the second
// one has is_insertion=false. We must make sure this is not a close client.
if (!status.ok()) {
return;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Under what circumstances would status not be ok? I agree that failing to connect to the remote client shouldn't be a fatal error, but won't things go wrong later if we don't succeed at connecting to it here?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we return here and don't connect, when will the connection be established?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ClientTableDataT may come from a disconnected client. The connection should never be established.


ResourceSet resources_total(client_data.resources_total_label,
client_data.resources_total_capacity);
this->cluster_resource_map_.emplace(client_id, SchedulingResources(resources_total));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's avoid using this (I realize this is copied from above).

@guoyuhong
Copy link
Contributor Author

@robertnishihara @richardliaw By my understanding, when a raylet disconnected, it will write a message to ClientTable with is_insertion=false. And other raylet client will process it with ClientRemoved. But when a new raylet connected in, ClientAdded will be called several time with each of the ClientTable entry. For a disconnected client that has two entries in ClientTable, the first call will have is_insertion=true and the second call will have is_insertion=false.

Another way to solve this could be cleaning up both entries for the disconnected client.

@guoyuhong
Copy link
Contributor Author

Sorry to mis-express the understanding. For a entry with is_insertion=false, it will be processed by ClientRemoved.

The problem is that, after a raylet client disconnected, it only append an entry with is_insertion=false, the previous entry with is_insertion=true is not removed. Therefore, when a new raylet is connecting, it will receive the first entry which tells the closed client is still alive, and the the connection to the closed client will fail.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/8284/
Test PASSed.

@zhijunfu
Copy link
Contributor

This seems related to #2852

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/8319/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/8323/
Test PASSed.

@guoyuhong
Copy link
Contributor Author

@robertnishihara Is this PR OK for you? Is RAY_CHECK(status.message() == "Connection refused") strange and fragile if the message changes?

@robertnishihara
Copy link
Collaborator

@guoyuhong I agree it feels fragile. I removed it. I think this is pretty good now.

Would you be able to add a test in a follow up PR?

E.g., the test should go in test/multi_node_test.py and should essentially do the following.

  • Start one raylet with ray start --head
  • Start another with ray start --redis-address=...
  • Make sure that both raylets are up (e.g., by checking ray.global_state.client_table())
  • Kill the second raylet
  • Start another raylet with ray start --redis-address=...
  • Run some tasks and make sure they get scheduled on two distinct raylets

What do you think about something like that?

// The first one has is_insertion=true and the second one has is_insertion=false.
// When a new raylet starts, ClientAdded will be called with the disconnected client's
// first entry, which will cause IOError and "Connection refused".
if (status.code() == StatusCode::IOError) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's probably preferable to just check !status.ok() like you were doing before

RAY_CHECK(status.message() == "Connection refused");
RAY_LOG(WARNING) << "ClientAdded: " << status.message()
<< " (conde=" << static_cast<int>(status.code()) << ")"
<< " which may be caused by a disconnected client.";
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RAY_LOG(WARNING) << "Failed to connect to client " << client_id 
                 << " in ClientAdded. TcpConnect returned status: "
                 << status.ToString() << ". This may be caused by "
                 << "trying to connect to a node manager that has failed.";

// Establish a new NodeManager connection to this GCS client.
auto client_info = gcs_client_->client_table().GetClient(client_id);
RAY_LOG(DEBUG) << "[ClientAdded] CONNECTING TO: "
RAY_LOG(DEBUG) << "[ClientAdded] TRY TO CONNECTING TO: "
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RAY_LOG(DEBUG) << "[ClientAdded] Trying to connect to client " << client_id
               << " at " << client_info.node_manager_address << ":"
               << client_info.node_manager_port;

@guoyuhong
Copy link
Contributor Author

@robertnishihara Yes. I will add the test in later PR.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/8327/
Test PASSed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants