[core] Fix GCS FD usage increase regression. #35624

fishbone · 2023-05-22T21:13:05Z

Why are these changes needed?

After GCS client is moved to cpp, the FD usage is increased by one. Previously it's 2 and after this, it's 3.

In the fix, we reuse the channel to make sure only 2 connections between GCS and CoreWorker. We still create 3 channels, but we use the same arguments to create the channels and depends on gRPC to reuse the TCP connections created.

The reason why previously it's 2 hasn't been figured out. Maybe gRPC has some work hidden which can reuse the connection in sone way.

Related issue number

#34635

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Yi Cheng <[email protected]>

src/ray/rpc/gcs_server/gcs_rpc_client.cc

scv119 · 2023-05-22T21:25:24Z

looks like a low risk fix.

worth let @rkooo567 or @jjyao to have second look!

src/ray/rpc/gcs_server/gcs_rpc_client.cc

Signed-off-by: Yi Cheng <[email protected]>

fishbone · 2023-05-22T21:57:27Z

[INFO 2023-05-22 21:51:58,023] log.py: 31  Got the following metadata:
  name:    many_nodes_actor_test_on_v2.aws
  status:  finished
  runtime: 1271.33
  stable:  True

  buildkite_url:
  wheels_url:    https://yic-data.s3.us-west-2.amazonaws.com/ray-3.0.0.gcs0-cp38-cp38-linux_x86_64.whl
  cluster_url:   https://console.anyscale-staging.com/o/anyscale-internal/projects/prj_qC3ZfndQWYYjx2cz8KWGNUL4/clusters/ses_nmhigs1zrxknuelzk5bj4muh5b
  job_url:   https://console.anyscale-staging.com/o/anyscale-internal/jobs/prodjob_ccem5ljmc32c3gftyemjheawc5

[INFO 2023-05-22 21:51:58,023] log.py: 41  Observed the following results:

  many_nodes_actor_tests_10000 = {'actor_launch_time': 3.002324881999982, 'actor_ready_time': 34.008151804999954, 'total_time': 37.01047668699994, 'num_actors': 10000, 'success': '1', 'throughput': 270.1937639055737}
  many_nodes_actor_tests_20000 = {'actor_launch_time': 5.144363610000028, 'actor_ready_time': 156.69886434700004, 'total_time': 161.84322795700007, 'num_actors': 20000, 'success': '1', 'throughput': 123.57637852671708}
  perf_metrics = [{'perf_metric_name': 'many_nodes_actor_tests_10000', 'perf_metric_value': 270.1937639055737, 'perf_metric_type': 'THROUGHPUT'}, {'perf_metric_name': 'many_nodes_actor_tests_20000', 'perf_metric_value': 123.57637852671708, 'perf_metric_type': 'THROUGHPUT'}, {'perf_metric_name': 'dashboard_p50_latency_ms', 'perf_metric_value': 5.607, 'perf_metric_type': 'LATENCY'}, {'perf_metric_name': 'dashboard_p95_latency_ms', 'perf_metric_value': 247.633, 'perf_metric_type': 'LATENCY'}, {'perf_metric_name': 'dashboard_p99_latency_ms', 'perf_metric_value': 663.018, 'perf_metric_type': 'LATENCY'}]
  _dashboard_test_success = True
  _dashboard_memory_usage_mb = 711.831552

Signed-off-by: Yi Cheng <[email protected]>

src/ray/gcs/gcs_client/gcs_client.cc

pcmoritz · 2023-05-22T22:40:48Z

Thanks a ton for fixing this :)

Signed-off-by: Yi Cheng <[email protected]>

jjyao · 2023-05-23T04:03:22Z

src/ray/rpc/gcs_server/gcs_rpc_client.cc

+      arguments_.c_channel_args().num_args == arguments.c_channel_args().num_args) {
+    return channel_;
+  } else {
+    RAY_LOG(WARNING) << "Generate a new GCS channel: " << address << ":" << port


When can this happen? When GCS restarts?

can you add a comment? (that this should never happen in testing). It might be also great if we use ERROR instead of WARNING and write "This shouldn't happen unless it is testing" in the log message.

Hi Sang, I agree with you about this. But this won't impact the correctness. So if there are cases which falls into this and we print, it doesn't look good.

Actually I think if you ray.init to different GCS this might happen in the driver side. Let me follow Chen's comments just update the global client).

yeah, that also works. I believe if we have 2 clients at the same time, this could have a correctness issue, but I guess this shouldn't happen.

@rkooo567 I think one case:

ray.init(GCS_1) # do something ray.shutdown() ray.init(GCS_2) # do something ray.shutdown()

My concern is ERROR could push the error to the driver which doesn't look good.

Signed-off-by: Yi Cheng <[email protected]>

rkooo567 · 2023-05-23T14:02:39Z

I think previously, we had 1 for python & 1 for core worker?

python/ray/tests/test_advanced_9.py

rkooo567 · 2023-05-23T14:09:41Z

src/ray/gcs/gcs_client/gcs_client.cc

-  auto arguments = PythonGrpcChannelArguments();
-  channel_ = rpc::BuildChannel(options_.gcs_address_, options_.gcs_port_, arguments);
+  channel_ =
+      rpc::GcsRpcClient::GetDefaultChannel(options_.gcs_address_, options_.gcs_port_);


Are we planning to cherry pick this btw? There's a bit of concern we change this settings. It looks like after this all python clients' timeout will be from 60 -> 30 seconds. Should we increase the default grpc_client_keepalive_time_ms to 60 seconds?

Hmmm, I feel if core worker's gcs client got time out, it's also considered bad, and it won't progress. Given this, to make things alive, we need both to be alive. If I understand it correctly. So it should be OK I think.

src/ray/rpc/gcs_server/gcs_rpc_client.cc

rkooo567 · 2023-05-23T14:24:16Z

src/ray/rpc/gcs_server/gcs_rpc_client.cc

+      arguments_.c_channel_args().num_args == arguments.c_channel_args().num_args) {
+    return channel_;
+  } else {
+    RAY_LOG(WARNING) << "Generate a new GCS channel: " << address << ":" << port


can you add a comment? (that this should never happen in testing). It might be also great if we use ERROR instead of WARNING and write "This shouldn't happen unless it is testing" in the log message.

Signed-off-by: Yi Cheng <[email protected]>

fishbone · 2023-05-23T21:52:38Z

Created a ticket to track the follow up as P1. #35684

fishbone · 2023-05-23T22:00:43Z

I think previously, we had 1 for python & 1 for core worker?

@rkooo567 this is also what I thought (#35546) But after following Chen's suggestion, surprisingly, there are only 2 sockets created somehow before #33769 and 3 after this. So I'm sure the one more client comes from that PR that we don't reuse the channel (py reuse it).

Previously I thought TCP shouldn't be reused automatically if channel's arguments are not the same. But somehow it only increase by 2. I think I still can't understand it correctly. But this PR should be sure, we only have one GCS channel from core worker -> GCS.

Signed-off-by: Yi Cheng <[email protected]>

fishbone · 2023-05-23T22:10:25Z

python/ray/tests/test_advanced_9.py

        p = psutil.Process(gcs_server_pid)
-        print(">>", p.num_fds())
-        return p.num_fds()
+        print(">>", len(p.connections()))


Use connections instead of fds for better measurement of the sockets in GCS.

Signed-off-by: Yi Cheng <[email protected]>

fishbone · 2023-05-24T07:55:09Z

The bad thing here is that working in this way somehow make the test run slower and a lot of tests got failed. This is pretty bad and hard to fix (some are because it's written bad and some are because it out of time.)

One thing we can use is to don't ask them to share the channel which will give us the perfs back. The regression doesn't come from the lock.

gRPC internally will try to reuse the socket if the arguments are the same and we can make use of that. But the test somehow there is one more FDs created.

I don't have bandwidth to figure out what's going go, so basically just add the test by one.

fishbone · 2023-05-24T19:50:37Z

test finished here https://console.anyscale-staging.com/o/anyscale-internal/jobs/prodjob_ywxup58wj76i8567e52l3uiijb

fishbone · 2023-05-24T19:51:10Z

cc @scv119 @pcmoritz @rkooo567 for awareness

fishbone · 2023-05-24T19:51:49Z

One thing we can use is to don't ask them to share the channel which will give us the perfs back. The regression doesn't come from the lock.

@iycheng, I don't fully understand some of the comments here. Could we sit together with @scv119 and review the PRs and see how we could work together to ship it reliably and smoothly? I will set up a meeting tomorrow as @scv119 is OOF today.

Sorry I missed your comment. Feel free to setup any meeting discussing this.

## Why are these changes needed? After GCS client is moved to cpp, the FD usage is increased by one. Previously it's 2 and after this, it's 3. In the fix, we reuse the channel to make sure only 2 connections between GCS and CoreWorker. We still create 3 channels, but we use the same arguments to create the channels and depends on gRPC to reuse the TCP connections created. The reason why previously it's 2 hasn't been figured out. Maybe gRPC has some work hidden which can reuse the connection in sone way. ## Related issue number ray-project#34635

## Why are these changes needed? After GCS client is moved to cpp, the FD usage is increased by one. Previously it's 2 and after this, it's 3. In the fix, we reuse the channel to make sure only 2 connections between GCS and CoreWorker. We still create 3 channels, but we use the same arguments to create the channels and depends on gRPC to reuse the TCP connections created. The reason why previously it's 2 hasn't been figured out. Maybe gRPC has some work hidden which can reuse the connection in sone way. ## Related issue number #34635

## Why are these changes needed? After GCS client is moved to cpp, the FD usage is increased by one. Previously it's 2 and after this, it's 3. In the fix, we reuse the channel to make sure only 2 connections between GCS and CoreWorker. We still create 3 channels, but we use the same arguments to create the channels and depends on gRPC to reuse the TCP connections created. The reason why previously it's 2 hasn't been figured out. Maybe gRPC has some work hidden which can reuse the connection in sone way. ## Related issue number ray-project#34635

## Why are these changes needed? After GCS client is moved to cpp, the FD usage is increased by one. Previously it's 2 and after this, it's 3. In the fix, we reuse the channel to make sure only 2 connections between GCS and CoreWorker. We still create 3 channels, but we use the same arguments to create the channels and depends on gRPC to reuse the TCP connections created. The reason why previously it's 2 hasn't been figured out. Maybe gRPC has some work hidden which can reuse the connection in sone way. ## Related issue number ray-project#34635 Signed-off-by: e428265 <[email protected]>

fishbone added 3 commits May 22, 2023 21:05

fix

7fcf25b

Signed-off-by: Yi Cheng <[email protected]>

fix

6a1e72e

Signed-off-by: Yi Cheng <[email protected]>

fix

86f7576

Signed-off-by: Yi Cheng <[email protected]>

fishbone requested a review from a team as a code owner May 22, 2023 21:13

fishbone assigned scv119 and rkooo567 May 22, 2023

scv119 reviewed May 22, 2023

View reviewed changes

src/ray/rpc/gcs_server/gcs_rpc_client.cc Outdated Show resolved Hide resolved

scv119 reviewed May 22, 2023

View reviewed changes

src/ray/rpc/gcs_server/gcs_rpc_client.cc Outdated Show resolved Hide resolved

scv119 approved these changes May 22, 2023

View reviewed changes

scv119 reviewed May 22, 2023

View reviewed changes

src/ray/rpc/gcs_server/gcs_rpc_client.cc Outdated Show resolved Hide resolved

fix comments

f4ddc4e

Signed-off-by: Yi Cheng <[email protected]>

fix

d9aa8c8

Signed-off-by: Yi Cheng <[email protected]>

pcmoritz reviewed May 22, 2023

View reviewed changes

src/ray/gcs/gcs_client/gcs_client.cc Outdated Show resolved Hide resolved

fishbone added 4 commits May 22, 2023 22:54

fix comments

99bc27d

Signed-off-by: Yi Cheng <[email protected]>

add copy right

b408241

Signed-off-by: Yi Cheng <[email protected]>

fix

79ec57b

Signed-off-by: Yi Cheng <[email protected]>

format

aabc515

Signed-off-by: Yi Cheng <[email protected]>

jjyao reviewed May 23, 2023

View reviewed changes

fishbone added 3 commits May 23, 2023 04:24

fix

9c8e0a1

Signed-off-by: Yi Cheng <[email protected]>

fix

c6818ff

Signed-off-by: Yi Cheng <[email protected]>

fix

157c3ed

Signed-off-by: Yi Cheng <[email protected]>

fishbone force-pushed the fix-fd-increase branch from a560265 to 157c3ed Compare May 23, 2023 04:50

fishbone added 2 commits May 23, 2023 05:17

fix

088947f

Signed-off-by: Yi Cheng <[email protected]>

fix

7e8bd34

Signed-off-by: Yi Cheng <[email protected]>

pcmoritz approved these changes May 23, 2023

View reviewed changes

fix

c9572f6

Signed-off-by: Yi Cheng <[email protected]>

rkooo567 reviewed May 23, 2023

View reviewed changes

rkooo567 added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label May 23, 2023

fishbone added 3 commits May 23, 2023 19:28

Merge remote-tracking branch 'upstream/master' into fix-fd-increase

d76c724

Signed-off-by: Yi Cheng <[email protected]>

fix

d5c1ff5

Signed-off-by: Yi Cheng <[email protected]>

fix

7b33222

Signed-off-by: Yi Cheng <[email protected]>

fishbone mentioned this pull request May 23, 2023

[core] Better management of GCS client in CoreWorker. #35684

Open

fishbone added 2 commits May 23, 2023 22:06

fix

5213474

Signed-off-by: Yi Cheng <[email protected]>

fix comment

7053556

Signed-off-by: Yi Cheng <[email protected]>

fishbone commented May 23, 2023

View reviewed changes

fishbone added 3 commits May 23, 2023 23:44

fix test

98b2f08

Signed-off-by: Yi Cheng <[email protected]>

Merge remote-tracking branch 'upstream/master' into fix-fd-increase

7b151ec

Signed-off-by: Yi Cheng <[email protected]>

fix

8a017b1

Signed-off-by: Yi Cheng <[email protected]>

fishbone added the do-not-merge Do not merge this PR! label May 24, 2023

fishbone merged commit f78626a into ray-project:master May 24, 2023

fishbone removed the do-not-merge Do not merge this PR! label May 24, 2023

zhe-thoughts mentioned this pull request May 24, 2023

[cherry-pick][core] Fix GCS FD usage increase regression. (#35624) #35738

Merged

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[core] Fix GCS FD usage increase regression. #35624

[core] Fix GCS FD usage increase regression. #35624

fishbone commented May 22, 2023 •

edited

Loading

scv119 commented May 22, 2023

fishbone commented May 22, 2023

pcmoritz commented May 22, 2023

jjyao May 23, 2023

fishbone May 23, 2023

rkooo567 May 23, 2023

fishbone May 23, 2023

rkooo567 May 23, 2023

fishbone May 23, 2023

rkooo567 commented May 23, 2023

rkooo567 May 23, 2023

fishbone May 23, 2023

rkooo567 May 23, 2023

fishbone commented May 23, 2023

fishbone commented May 23, 2023

fishbone May 23, 2023

fishbone commented May 24, 2023

fishbone commented May 24, 2023

fishbone commented May 24, 2023

fishbone commented May 24, 2023

[core] Fix GCS FD usage increase regression. #35624

[core] Fix GCS FD usage increase regression. #35624

Conversation

fishbone commented May 22, 2023 • edited Loading

Why are these changes needed?

Related issue number

Checks

scv119 commented May 22, 2023

fishbone commented May 22, 2023

pcmoritz commented May 22, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rkooo567 commented May 23, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fishbone commented May 23, 2023

fishbone commented May 23, 2023

Choose a reason for hiding this comment

fishbone commented May 24, 2023

fishbone commented May 24, 2023

fishbone commented May 24, 2023

fishbone commented May 24, 2023

fishbone commented May 22, 2023 •

edited

Loading