Revert "Revert "Deflakey test advanced 9"" #35091

fishbone · 2023-05-05T20:12:51Z

Reverts #35090

async_wait doesn't work in the same way in linux/mac/windows. Using another method to monitor the connection.

Only updates are in the commit

The new method listens to two events:

EOF
Connection failure.

And this should cover all platforms

This reverts commit 6de9403.

fishbone · 2023-05-06T05:45:33Z

Hmmm. linux not working in this way.. :(

Signed-off-by: Yi Cheng <[email protected]>

rkooo567

Have some questions/concerns regarding reading sockets from this new method + original client.

Alternatively, what about we just handle the failure here?

ray/src/ray/common/client_connection.cc

Line 515 in c70331d

read_message_ = error_data;

So if it is failed & call DisconnectClient, we call Close() inside client_connection.cc and invoke a callback (which reports the worker failure). We can also make the additional boolean like "is_disconnected" to make this idempotent?

rkooo567 · 2023-05-10T06:21:26Z

src/ray/core_worker/core_worker.cc

  KillChildProcs();

+  // Disconnect here before KillChildProcs to make the Raylet async wait shorter.


Is this comment valid?

Disconnect here before KillChildProcs

The code looks like we Disconnect "after" KillChildProcs?

rkooo567 · 2023-05-10T06:21:34Z

src/ray/core_worker/core_worker.cc

          KillChildProcs();
+          // Disconnect here after KillChildProcs to make the Raylet async wait shorter.


same comment as below

rkooo567 · 2023-05-10T06:22:16Z

src/ray/common/client_connection.h

@@ -138,6 +141,11 @@ class ServerConnection : public std::enable_shared_from_this<ServerConnection> {
    std::function<void(const ray::Status &)> handler;
  };

+  enum struct ConnectionStatus { RUNNING = 0, TERMINATING, TERMINATED };


Suggested change

enum struct ConnectionStatus { RUNNING = 0, TERMINATING, TERMINATED };

enum class ConnectionStatus { RUNNING = 0, TERMINATING = 1, TERMINATED = 2 };

just personal preference haha..

rkooo567 · 2023-05-10T06:23:00Z

src/ray/common/client_connection.h

@@ -125,6 +126,8 @@ class ServerConnection : public std::enable_shared_from_this<ServerConnection> {

  std::string DebugString() const;

+  void AsyncWaitTerminated(std::function<void()> callback);


Can you add a docstring?

rkooo567 · 2023-05-10T06:23:21Z

src/ray/common/client_connection.h

@@ -125,6 +126,8 @@ class ServerConnection : public std::enable_shared_from_this<ServerConnection> {

  std::string DebugString() const;

+  void AsyncWaitTerminated(std::function<void()> callback);


Suggested change

void AsyncWaitTerminated(std::function<void()> callback);

void CloseAndAsyncWaitTerminated(std::function<void()> callback);

Actually, we are not closing it actively I believe. It's waiting until bad thing happened passivately. Maybe we shouldn't include close there?

I see. I thought we close it because we have Close() method inside this Wait method (after async_wait).

rkooo567 · 2023-05-10T06:49:51Z

src/ray/common/client_connection.cc

+    status_ = ConnectionStatus::TERMINATING;
+  }
+
+  if (status_ == ConnectionStatus::TERMINATING) {


Hmm, I don't know if I understand this part correctly. I have a couple questions here;

Isn't it possible this code can corrupt the existing message handler? For example, it seems to be possible if the connection wasn't closed, this can read the socket that's supposed to be read by the raylet message handler, which can corrupt the data passed to the message handler. I feel like the behavior could be some ungrateful failure if this happens.

If status == TERMINATING, we already guaranteed that the socket will be closed (because we close the connection after async_wait). IN this case, do we need any additional logic? Isn't this sufficient to just make the method idempotent by adding if (status==TERMINATING) return;?

rkooo567 · 2023-05-10T06:57:05Z

src/ray/common/client_connection.cc

+  if (status_ == ConnectionStatus::RUNNING) {
+    socket_.async_wait(local_stream_socket::wait_type::wait_error,
+                       [this, callback](auto) {
+                         if (status_ != ConnectionStatus::TERMINATED) {


Q: Isn't it possible this is returned "before" the connection is actually closed? It seems like this can return when it is simply ready to read the socket.

rkooo567 · 2023-05-10T06:57:16Z

src/ray/common/client_connection.cc

@@ -183,6 +183,41 @@ void ServerConnection::ReadBufferAsync(
        });
  }
 }
+void ServerConnection::AsyncWaitTerminated(std::function<void()> callback) {


Can we write unit tests for this method? Seems like it is worth testing it (the logic is pretty complicated).

rkooo567

Have some questions/concerns regarding reading sockets from this new method + original client.

Alternatively, what about we just handle the failure here?

ray/src/ray/common/client_connection.cc

Line 515 in c70331d

read_message_ = error_data;

So if it is failed & call DisconnectClient, we call Close() inside client_connection.cc and invoke a callback (which reports the worker failure). We can also make the additional boolean like "is_disconnected" to make this idempotent?

fishbone · 2023-05-10T07:16:14Z

Have some questions/concerns regarding reading sockets from this new method + original client.

Alternatively, what about we just handle the failure here?

ray/src/ray/common/client_connection.cc

Line 515 in c70331d

read_message_ = error_data;

So if it is failed & call DisconnectClient, we call Close() inside client_connection.cc and invoke a callback (which reports the worker failure). We can also make the additional boolean like "is_disconnected" to make this idempotent?

The issue here is that, once we received DisconnectClient, the draining part is not handled any more. And the draining is similar as this one.

Maybe I can change that logic there and make the change simpler and easier to understand. Let me give it a try.

rkooo567 · 2023-05-10T09:02:14Z

Sgtm! I scheduled a short meeting tomorrow to talk more about "the draining part is not handled any more. And the draining is similar as this one.

"

fishbone · 2023-05-11T00:29:50Z

Offline synced. I'm going to close this one and open two things:

refactoring process by using boost process
make a short term fix by exchanging the lines in core worker.

Revert "Revert "Deflakey test advanced 9 (#34883)" (#35090)"

972cbf9

This reverts commit 6de9403.

fishbone force-pushed the revert-35090-revert-34883-deflakey-test-advanced-9 branch from 4a83335 to bcdff25 Compare May 5, 2023 22:57

fishbone assigned jjyao and rkooo567 May 5, 2023

fishbone marked this pull request as ready for review May 5, 2023 22:58

fishbone requested a review from a team as a code owner May 5, 2023 22:58

fishbone force-pushed the revert-35090-revert-34883-deflakey-test-advanced-9 branch from bcdff25 to 66ce57d Compare May 5, 2023 22:59

fix

d17ce3b

fishbone force-pushed the revert-35090-revert-34883-deflakey-test-advanced-9 branch from 66ce57d to d17ce3b Compare May 5, 2023 23:00

fishbone mentioned this pull request May 5, 2023

[core][nightly] many_nodes_actor_test_on_v2.aws failed #34635

Closed

fishbone unassigned jjyao and rkooo567 May 6, 2023

fishbone marked this pull request as draft May 6, 2023 05:45

fix

e84f6f5

Signed-off-by: Yi Cheng <[email protected]>

fishbone marked this pull request as ready for review May 6, 2023 22:06

fishbone assigned scv119, jjyao and rkooo567 May 6, 2023

rkooo567 reviewed May 10, 2023

View reviewed changes

rkooo567 added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label May 10, 2023

fishbone closed this May 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Revert "Revert "Deflakey test advanced 9"" #35091

Revert "Revert "Deflakey test advanced 9"" #35091

fishbone commented May 5, 2023 •

edited

Loading

fishbone commented May 6, 2023

rkooo567 left a comment

rkooo567 May 10, 2023

rkooo567 May 10, 2023

rkooo567 May 10, 2023

rkooo567 May 10, 2023

rkooo567 May 10, 2023

fishbone May 10, 2023

rkooo567 May 10, 2023

rkooo567 May 10, 2023

rkooo567 May 10, 2023

rkooo567 May 10, 2023

rkooo567 left a comment

fishbone commented May 10, 2023

rkooo567 commented May 10, 2023 •

edited

Loading

fishbone commented May 11, 2023

		KillChildProcs();

		// Disconnect here before KillChildProcs to make the Raylet async wait shorter.

		KillChildProcs();
		// Disconnect here after KillChildProcs to make the Raylet async wait shorter.

	enum struct ConnectionStatus { RUNNING = 0, TERMINATING, TERMINATED };
	enum class ConnectionStatus { RUNNING = 0, TERMINATING = 1, TERMINATED = 2 };

		@@ -125,6 +126,8 @@ class ServerConnection : public std::enable_shared_from_this<ServerConnection> {

		std::string DebugString() const;

		void AsyncWaitTerminated(std::function<void()> callback);

	void AsyncWaitTerminated(std::function<void()> callback);
	void CloseAndAsyncWaitTerminated(std::function<void()> callback);

Revert "Revert "Deflakey test advanced 9"" #35091

Revert "Revert "Deflakey test advanced 9"" #35091

Conversation

fishbone commented May 5, 2023 • edited Loading

fishbone commented May 6, 2023

rkooo567 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rkooo567 left a comment

Choose a reason for hiding this comment

fishbone commented May 10, 2023

rkooo567 commented May 10, 2023 • edited Loading

fishbone commented May 11, 2023

fishbone commented May 5, 2023 •

edited

Loading

rkooo567 commented May 10, 2023 •

edited

Loading