-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[core] Fix GCS FD usage increase regression. #35624
Changes from 20 commits
7fcf25b
6a1e72e
86f7576
f4ddc4e
d9aa8c8
99bc27d
b408241
79ec57b
aabc515
9c8e0a1
c6818ff
157c3ed
088947f
7e8bd34
c9572f6
d76c724
d5c1ff5
7b33222
5213474
7053556
98b2f08
7b151ec
8a017b1
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -146,8 +146,8 @@ std::pair<std::string, int> GcsClient::GetGcsServerAddress() const { | |
PythonGcsClient::PythonGcsClient(const GcsClientOptions &options) : options_(options) {} | ||
|
||
Status PythonGcsClient::Connect() { | ||
auto arguments = PythonGrpcChannelArguments(); | ||
channel_ = rpc::BuildChannel(options_.gcs_address_, options_.gcs_port_, arguments); | ||
channel_ = | ||
rpc::GcsRpcClient::GetDefaultChannel(options_.gcs_address_, options_.gcs_port_); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Are we planning to cherry pick this btw? There's a bit of concern we change this settings. It looks like after this all python clients' timeout will be from 60 -> 30 seconds. Should we increase the default grpc_client_keepalive_time_ms to 60 seconds? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hmmm, I feel if core worker's gcs client got time out, it's also considered bad, and it won't progress. Given this, to make things alive, we need both to be alive. If I understand it correctly. So it should be OK I think. |
||
kv_stub_ = rpc::InternalKVGcsService::NewStub(channel_); | ||
runtime_env_stub_ = rpc::RuntimeEnvGcsService::NewStub(channel_); | ||
node_info_stub_ = rpc::NodeInfoGcsService::NewStub(channel_); | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,72 @@ | ||
// Copyright 2023 The Ray Authors. | ||
// | ||
// Licensed under the Apache License, Version 2.0 (the "License"); | ||
// you may not use this file except in compliance with the License. | ||
// You may obtain a copy of the License at | ||
// | ||
// http://www.apache.org/licenses/LICENSE-2.0 | ||
// | ||
// Unless required by applicable law or agreed to in writing, software | ||
// distributed under the License is distributed on an "AS IS" BASIS, | ||
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
// See the License for the specific language governing permissions and | ||
// limitations under the License. | ||
|
||
#include "ray/rpc/gcs_server/gcs_rpc_client.h" | ||
|
||
namespace ray { | ||
namespace rpc { | ||
grpc::ChannelArguments GetGcsRpcClientArguments() { | ||
grpc::ChannelArguments arguments = CreateDefaultChannelArguments(); | ||
arguments.SetInt(GRPC_ARG_MAX_RECONNECT_BACKOFF_MS, | ||
::RayConfig::instance().gcs_grpc_max_reconnect_backoff_ms()); | ||
arguments.SetInt(GRPC_ARG_MIN_RECONNECT_BACKOFF_MS, | ||
::RayConfig::instance().gcs_grpc_min_reconnect_backoff_ms()); | ||
arguments.SetInt(GRPC_ARG_INITIAL_RECONNECT_BACKOFF_MS, | ||
::RayConfig::instance().gcs_grpc_initial_reconnect_backoff_ms()); | ||
return arguments; | ||
} | ||
|
||
std::shared_ptr<grpc::Channel> GcsRpcClient::GetDefaultChannel(const std::string &address, | ||
int port) { | ||
static std::mutex mu_; | ||
static std::shared_ptr<grpc::Channel> channel_; | ||
static std::string address_; | ||
static int port_ = 0; | ||
|
||
// Don't reuse channel if proxy or tls is set | ||
// TODO: Reuse the channel even it's tls. | ||
// Right now, if we do this, python/ray/serve/tests/test_grpc.py | ||
// will fail. | ||
if (::RayConfig::instance().grpc_enable_http_proxy() || | ||
::RayConfig::instance().USE_TLS()) { | ||
return BuildChannel(address, port, GetGcsRpcClientArguments()); | ||
} | ||
|
||
std::lock_guard<std::mutex> guard(mu_); | ||
if (channel_ == nullptr || (address_ != address || port_ != port)) { | ||
address_ = address; | ||
fishbone marked this conversation as resolved.
Show resolved
Hide resolved
|
||
port_ = port; | ||
|
||
// This condition shouldn't happen in most cases. It could only happen when | ||
// ray driver wanted to talk with different GCS. | ||
// - This mostly happens in testing, where the test main process is the driver. | ||
// It calls ray.init and then ray.shutdown and later ray.init with a different | ||
// GCS address. | ||
// - Potentially it can also happen in the user's driver where there are two | ||
// ray clusters and the user ray.init and ray.shutdown and then tries to | ||
// connect to a different GCS. | ||
if (channel_ != nullptr) { | ||
RAY_LOG(WARNING) << "Generate a new GCS channel: " << address << ":" << port | ||
<< ". Potentially it will increase GCS socket numbers." | ||
<< " This could only happen in testing or in the same driver " | ||
<< " it tries to connect to different GCS clusters."; | ||
} | ||
channel_ = BuildChannel(address, port, GetGcsRpcClientArguments()); | ||
} | ||
|
||
return channel_; | ||
} | ||
|
||
} // namespace rpc | ||
} // namespace ray |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use connections instead of fds for better measurement of the sockets in GCS.