Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revert "Revert "Deflakey test advanced 9"" #35091

Closed

Conversation

fishbone
Copy link
Contributor

@fishbone fishbone commented May 5, 2023

Reverts #35090

async_wait doesn't work in the same way in linux/mac/windows. Using another method to monitor the connection.

Only updates are in the commit

The new method listens to two events:

  • EOF
  • Connection failure.

And this should cover all platforms

@fishbone fishbone force-pushed the revert-35090-revert-34883-deflakey-test-advanced-9 branch from 4a83335 to bcdff25 Compare May 5, 2023 22:57
@fishbone fishbone marked this pull request as ready for review May 5, 2023 22:58
@fishbone fishbone requested a review from a team as a code owner May 5, 2023 22:58
@fishbone fishbone force-pushed the revert-35090-revert-34883-deflakey-test-advanced-9 branch from bcdff25 to 66ce57d Compare May 5, 2023 22:59
@fishbone fishbone force-pushed the revert-35090-revert-34883-deflakey-test-advanced-9 branch from 66ce57d to d17ce3b Compare May 5, 2023 23:00
@fishbone
Copy link
Contributor Author

fishbone commented May 6, 2023

Hmmm. linux not working in this way.. :(

@fishbone fishbone marked this pull request as draft May 6, 2023 05:45
Signed-off-by: Yi Cheng <[email protected]>
@fishbone fishbone marked this pull request as ready for review May 6, 2023 22:06
Copy link
Contributor

@rkooo567 rkooo567 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have some questions/concerns regarding reading sockets from this new method + original client.

Alternatively, what about we just handle the failure here?

read_message_ = error_data;

So if it is failed & call DisconnectClient, we call Close() inside client_connection.cc and invoke a callback (which reports the worker failure). We can also make the additional boolean like "is_disconnected" to make this idempotent?

KillChildProcs();

// Disconnect here before KillChildProcs to make the Raylet async wait shorter.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this comment valid?

Disconnect here before KillChildProcs

The code looks like we Disconnect "after" KillChildProcs?

KillChildProcs();
// Disconnect here after KillChildProcs to make the Raylet async wait shorter.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment as below

@@ -138,6 +141,11 @@ class ServerConnection : public std::enable_shared_from_this<ServerConnection> {
std::function<void(const ray::Status &)> handler;
};

enum struct ConnectionStatus { RUNNING = 0, TERMINATING, TERMINATED };
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
enum struct ConnectionStatus { RUNNING = 0, TERMINATING, TERMINATED };
enum class ConnectionStatus { RUNNING = 0, TERMINATING = 1, TERMINATED = 2 };

just personal preference haha..

@@ -125,6 +126,8 @@ class ServerConnection : public std::enable_shared_from_this<ServerConnection> {

std::string DebugString() const;

void AsyncWaitTerminated(std::function<void()> callback);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a docstring?

@@ -125,6 +126,8 @@ class ServerConnection : public std::enable_shared_from_this<ServerConnection> {

std::string DebugString() const;

void AsyncWaitTerminated(std::function<void()> callback);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
void AsyncWaitTerminated(std::function<void()> callback);
void CloseAndAsyncWaitTerminated(std::function<void()> callback);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, we are not closing it actively I believe. It's waiting until bad thing happened passivately. Maybe we shouldn't include close there?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. I thought we close it because we have Close() method inside this Wait method (after async_wait).

status_ = ConnectionStatus::TERMINATING;
}

if (status_ == ConnectionStatus::TERMINATING) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I don't know if I understand this part correctly. I have a couple questions here;

  1. Isn't it possible this code can corrupt the existing message handler? For example, it seems to be possible if the connection wasn't closed, this can read the socket that's supposed to be read by the raylet message handler, which can corrupt the data passed to the message handler. I feel like the behavior could be some ungrateful failure if this happens.
  2. If status == TERMINATING, we already guaranteed that the socket will be closed (because we close the connection after async_wait). IN this case, do we need any additional logic? Isn't this sufficient to just make the method idempotent by adding if (status==TERMINATING) return;?

if (status_ == ConnectionStatus::RUNNING) {
socket_.async_wait(local_stream_socket::wait_type::wait_error,
[this, callback](auto) {
if (status_ != ConnectionStatus::TERMINATED) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: Isn't it possible this is returned "before" the connection is actually closed? It seems like this can return when it is simply ready to read the socket.

@@ -183,6 +183,41 @@ void ServerConnection::ReadBufferAsync(
});
}
}
void ServerConnection::AsyncWaitTerminated(std::function<void()> callback) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we write unit tests for this method? Seems like it is worth testing it (the logic is pretty complicated).

Copy link
Contributor

@rkooo567 rkooo567 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have some questions/concerns regarding reading sockets from this new method + original client.

Alternatively, what about we just handle the failure here?

read_message_ = error_data;

So if it is failed & call DisconnectClient, we call Close() inside client_connection.cc and invoke a callback (which reports the worker failure). We can also make the additional boolean like "is_disconnected" to make this idempotent?

@rkooo567 rkooo567 added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label May 10, 2023
@fishbone
Copy link
Contributor Author

Have some questions/concerns regarding reading sockets from this new method + original client.

Alternatively, what about we just handle the failure here?

read_message_ = error_data;

So if it is failed & call DisconnectClient, we call Close() inside client_connection.cc and invoke a callback (which reports the worker failure). We can also make the additional boolean like "is_disconnected" to make this idempotent?

The issue here is that, once we received DisconnectClient, the draining part is not handled any more. And the draining is similar as this one.

Maybe I can change that logic there and make the change simpler and easier to understand. Let me give it a try.

@rkooo567
Copy link
Contributor

rkooo567 commented May 10, 2023

Sgtm! I scheduled a short meeting tomorrow to talk more about "the draining part is not handled any more. And the draining is similar as this one.

"

@fishbone
Copy link
Contributor Author

Offline synced. I'm going to close this one and open two things:

  • refactoring process by using boost process
  • make a short term fix by exchanging the lines in core worker.

@fishbone fishbone closed this May 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
@author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants