Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core] Fix GCS FD usage increase regression. #35624

Merged
merged 23 commits into from
May 24, 2023

Conversation

fishbone
Copy link
Contributor

@fishbone fishbone commented May 22, 2023

Why are these changes needed?

After GCS client is moved to cpp, the FD usage is increased by one. Previously it's 2 and after this, it's 3.

In the fix, we reuse the channel to make sure only 2 connections between GCS and CoreWorker. We still create 3 channels, but we use the same arguments to create the channels and depends on gRPC to reuse the TCP connections created.

The reason why previously it's 2 hasn't been figured out. Maybe gRPC has some work hidden which can reuse the connection in sone way.

Related issue number

#34635

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: Yi Cheng <[email protected]>
Signed-off-by: Yi Cheng <[email protected]>
Signed-off-by: Yi Cheng <[email protected]>
@fishbone fishbone requested a review from a team as a code owner May 22, 2023 21:13
@scv119
Copy link
Contributor

scv119 commented May 22, 2023

looks like a low risk fix.

worth let @rkooo567 or @jjyao to have second look!

Signed-off-by: Yi Cheng <[email protected]>
@fishbone
Copy link
Contributor Author

[INFO 2023-05-22 21:51:58,023] log.py: 31  Got the following metadata:
  name:    many_nodes_actor_test_on_v2.aws
  status:  finished
  runtime: 1271.33
  stable:  True

  buildkite_url:
  wheels_url:    https://yic-data.s3.us-west-2.amazonaws.com/ray-3.0.0.gcs0-cp38-cp38-linux_x86_64.whl
  cluster_url:   https://console.anyscale-staging.com/o/anyscale-internal/projects/prj_qC3ZfndQWYYjx2cz8KWGNUL4/clusters/ses_nmhigs1zrxknuelzk5bj4muh5b
  job_url:   https://console.anyscale-staging.com/o/anyscale-internal/jobs/prodjob_ccem5ljmc32c3gftyemjheawc5

[INFO 2023-05-22 21:51:58,023] log.py: 41  Observed the following results:

  many_nodes_actor_tests_10000 = {'actor_launch_time': 3.002324881999982, 'actor_ready_time': 34.008151804999954, 'total_time': 37.01047668699994, 'num_actors': 10000, 'success': '1', 'throughput': 270.1937639055737}
  many_nodes_actor_tests_20000 = {'actor_launch_time': 5.144363610000028, 'actor_ready_time': 156.69886434700004, 'total_time': 161.84322795700007, 'num_actors': 20000, 'success': '1', 'throughput': 123.57637852671708}
  perf_metrics = [{'perf_metric_name': 'many_nodes_actor_tests_10000', 'perf_metric_value': 270.1937639055737, 'perf_metric_type': 'THROUGHPUT'}, {'perf_metric_name': 'many_nodes_actor_tests_20000', 'perf_metric_value': 123.57637852671708, 'perf_metric_type': 'THROUGHPUT'}, {'perf_metric_name': 'dashboard_p50_latency_ms', 'perf_metric_value': 5.607, 'perf_metric_type': 'LATENCY'}, {'perf_metric_name': 'dashboard_p95_latency_ms', 'perf_metric_value': 247.633, 'perf_metric_type': 'LATENCY'}, {'perf_metric_name': 'dashboard_p99_latency_ms', 'perf_metric_value': 663.018, 'perf_metric_type': 'LATENCY'}]
  _dashboard_test_success = True
  _dashboard_memory_usage_mb = 711.831552

Signed-off-by: Yi Cheng <[email protected]>
@pcmoritz
Copy link
Contributor

Thanks a ton for fixing this :)

Signed-off-by: Yi Cheng <[email protected]>
Signed-off-by: Yi Cheng <[email protected]>
Signed-off-by: Yi Cheng <[email protected]>
Signed-off-by: Yi Cheng <[email protected]>
arguments_.c_channel_args().num_args == arguments.c_channel_args().num_args) {
return channel_;
} else {
RAY_LOG(WARNING) << "Generate a new GCS channel: " << address << ":" << port
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When can this happen? When GCS restarts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In testing

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a comment? (that this should never happen in testing). It might be also great if we use ERROR instead of WARNING and write "This shouldn't happen unless it is testing" in the log message.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Sang, I agree with you about this. But this won't impact the correctness. So if there are cases which falls into this and we print, it doesn't look good.

Actually I think if you ray.init to different GCS this might happen in the driver side. Let me follow Chen's comments just update the global client).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, that also works. I believe if we have 2 clients at the same time, this could have a correctness issue, but I guess this shouldn't happen.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rkooo567 I think one case:

ray.init(GCS_1)
# do something

ray.shutdown()

ray.init(GCS_2)
# do something
ray.shutdown()

My concern is ERROR could push the error to the driver which doesn't look good.

Signed-off-by: Yi Cheng <[email protected]>
Signed-off-by: Yi Cheng <[email protected]>
Signed-off-by: Yi Cheng <[email protected]>
Signed-off-by: Yi Cheng <[email protected]>
Signed-off-by: Yi Cheng <[email protected]>
Signed-off-by: Yi Cheng <[email protected]>
@rkooo567
Copy link
Contributor

I think previously, we had 1 for python & 1 for core worker?

python/ray/tests/test_advanced_9.py Show resolved Hide resolved
python/ray/tests/test_advanced_9.py Show resolved Hide resolved
auto arguments = PythonGrpcChannelArguments();
channel_ = rpc::BuildChannel(options_.gcs_address_, options_.gcs_port_, arguments);
channel_ =
rpc::GcsRpcClient::GetDefaultChannel(options_.gcs_address_, options_.gcs_port_);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we planning to cherry pick this btw? There's a bit of concern we change this settings. It looks like after this all python clients' timeout will be from 60 -> 30 seconds. Should we increase the default grpc_client_keepalive_time_ms to 60 seconds?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, I feel if core worker's gcs client got time out, it's also considered bad, and it won't progress. Given this, to make things alive, we need both to be alive. If I understand it correctly. So it should be OK I think.

src/ray/rpc/gcs_server/gcs_rpc_client.cc Outdated Show resolved Hide resolved
arguments_.c_channel_args().num_args == arguments.c_channel_args().num_args) {
return channel_;
} else {
RAY_LOG(WARNING) << "Generate a new GCS channel: " << address << ":" << port
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a comment? (that this should never happen in testing). It might be also great if we use ERROR instead of WARNING and write "This shouldn't happen unless it is testing" in the log message.

@rkooo567 rkooo567 added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label May 23, 2023
@fishbone
Copy link
Contributor Author

Created a ticket to track the follow up as P1. #35684

@fishbone
Copy link
Contributor Author

I think previously, we had 1 for python & 1 for core worker?

@rkooo567 this is also what I thought (#35546) But after following Chen's suggestion, surprisingly, there are only 2 sockets created somehow before #33769 and 3 after this. So I'm sure the one more client comes from that PR that we don't reuse the channel (py reuse it).

Previously I thought TCP shouldn't be reused automatically if channel's arguments are not the same. But somehow it only increase by 2. I think I still can't understand it correctly. But this PR should be sure, we only have one GCS channel from core worker -> GCS.

Signed-off-by: Yi Cheng <[email protected]>
Signed-off-by: Yi Cheng <[email protected]>
p = psutil.Process(gcs_server_pid)
print(">>", p.num_fds())
return p.num_fds()
print(">>", len(p.connections()))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use connections instead of fds for better measurement of the sockets in GCS.

@fishbone
Copy link
Contributor Author

The bad thing here is that working in this way somehow make the test run slower and a lot of tests got failed. This is pretty bad and hard to fix (some are because it's written bad and some are because it out of time.)

One thing we can use is to don't ask them to share the channel which will give us the perfs back. The regression doesn't come from the lock.

gRPC internally will try to reuse the socket if the arguments are the same and we can make use of that. But the test somehow there is one more FDs created.

I don't have bandwidth to figure out what's going go, so basically just add the test by one.

@fishbone fishbone added the do-not-merge Do not merge this PR! label May 24, 2023
@fishbone
Copy link
Contributor Author

@fishbone fishbone merged commit f78626a into ray-project:master May 24, 2023
@fishbone
Copy link
Contributor Author

cc @scv119 @pcmoritz @rkooo567 for awareness

@fishbone
Copy link
Contributor Author

One thing we can use is to don't ask them to share the channel which will give us the perfs back. The regression doesn't come from the lock.

@iycheng, I don't fully understand some of the comments here. Could we sit together with @scv119 and review the PRs and see how we could work together to ship it reliably and smoothly? I will set up a meeting tomorrow as @scv119 is OOF today.

Sorry I missed your comment. Feel free to setup any meeting discussing this.

@fishbone fishbone removed the do-not-merge Do not merge this PR! label May 24, 2023
fishbone added a commit to fishbone/ray that referenced this pull request May 24, 2023
## Why are these changes needed?

After GCS client is moved to cpp, the FD usage is increased by one. Previously it's 2 and after this, it's 3.

In the fix, we reuse the channel to make sure only 2 connections between GCS and CoreWorker. We still create 3 channels, but we use the same arguments to create the channels and depends on gRPC to reuse the TCP connections created.

The reason why previously it's 2 hasn't been figured out. Maybe gRPC has some work hidden which can reuse the connection in sone way.

## Related issue number
ray-project#34635
ArturNiederfahrenhorst pushed a commit that referenced this pull request May 25, 2023
## Why are these changes needed?

After GCS client is moved to cpp, the FD usage is increased by one. Previously it's 2 and after this, it's 3.

In the fix, we reuse the channel to make sure only 2 connections between GCS and CoreWorker. We still create 3 channels, but we use the same arguments to create the channels and depends on gRPC to reuse the TCP connections created.

The reason why previously it's 2 hasn't been figured out. Maybe gRPC has some work hidden which can reuse the connection in sone way.

## Related issue number
#34635
scv119 pushed a commit to scv119/ray that referenced this pull request Jun 16, 2023
## Why are these changes needed?

After GCS client is moved to cpp, the FD usage is increased by one. Previously it's 2 and after this, it's 3.

In the fix, we reuse the channel to make sure only 2 connections between GCS and CoreWorker. We still create 3 channels, but we use the same arguments to create the channels and depends on gRPC to reuse the TCP connections created.

The reason why previously it's 2 hasn't been figured out. Maybe gRPC has some work hidden which can reuse the connection in sone way.

## Related issue number
ray-project#34635
arvind-chandra pushed a commit to lmco/ray that referenced this pull request Aug 31, 2023
## Why are these changes needed?

After GCS client is moved to cpp, the FD usage is increased by one. Previously it's 2 and after this, it's 3.

In the fix, we reuse the channel to make sure only 2 connections between GCS and CoreWorker. We still create 3 channels, but we use the same arguments to create the channels and depends on gRPC to reuse the TCP connections created.

The reason why previously it's 2 hasn't been figured out. Maybe gRPC has some work hidden which can reuse the connection in sone way.

## Related issue number
ray-project#34635
Signed-off-by: e428265 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
@author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants