-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[core] Fix GCS FD usage increase regression. #35624
Conversation
Signed-off-by: Yi Cheng <[email protected]>
Signed-off-by: Yi Cheng <[email protected]>
Signed-off-by: Yi Cheng <[email protected]>
Signed-off-by: Yi Cheng <[email protected]>
|
Signed-off-by: Yi Cheng <[email protected]>
Thanks a ton for fixing this :) |
Signed-off-by: Yi Cheng <[email protected]>
Signed-off-by: Yi Cheng <[email protected]>
Signed-off-by: Yi Cheng <[email protected]>
Signed-off-by: Yi Cheng <[email protected]>
arguments_.c_channel_args().num_args == arguments.c_channel_args().num_args) { | ||
return channel_; | ||
} else { | ||
RAY_LOG(WARNING) << "Generate a new GCS channel: " << address << ":" << port |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When can this happen? When GCS restarts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In testing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a comment? (that this should never happen in testing). It might be also great if we use ERROR instead of WARNING and write "This shouldn't happen unless it is testing" in the log message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Sang, I agree with you about this. But this won't impact the correctness. So if there are cases which falls into this and we print, it doesn't look good.
Actually I think if you ray.init to different GCS this might happen in the driver side. Let me follow Chen's comments just update the global client).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, that also works. I believe if we have 2 clients at the same time, this could have a correctness issue, but I guess this shouldn't happen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rkooo567 I think one case:
ray.init(GCS_1)
# do something
ray.shutdown()
ray.init(GCS_2)
# do something
ray.shutdown()
My concern is ERROR could push the error to the driver which doesn't look good.
Signed-off-by: Yi Cheng <[email protected]>
Signed-off-by: Yi Cheng <[email protected]>
Signed-off-by: Yi Cheng <[email protected]>
Signed-off-by: Yi Cheng <[email protected]>
Signed-off-by: Yi Cheng <[email protected]>
Signed-off-by: Yi Cheng <[email protected]>
I think previously, we had 1 for python & 1 for core worker? |
src/ray/gcs/gcs_client/gcs_client.cc
Outdated
auto arguments = PythonGrpcChannelArguments(); | ||
channel_ = rpc::BuildChannel(options_.gcs_address_, options_.gcs_port_, arguments); | ||
channel_ = | ||
rpc::GcsRpcClient::GetDefaultChannel(options_.gcs_address_, options_.gcs_port_); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we planning to cherry pick this btw? There's a bit of concern we change this settings. It looks like after this all python clients' timeout will be from 60 -> 30 seconds. Should we increase the default grpc_client_keepalive_time_ms to 60 seconds?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm, I feel if core worker's gcs client got time out, it's also considered bad, and it won't progress. Given this, to make things alive, we need both to be alive. If I understand it correctly. So it should be OK I think.
arguments_.c_channel_args().num_args == arguments.c_channel_args().num_args) { | ||
return channel_; | ||
} else { | ||
RAY_LOG(WARNING) << "Generate a new GCS channel: " << address << ":" << port |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a comment? (that this should never happen in testing). It might be also great if we use ERROR instead of WARNING and write "This shouldn't happen unless it is testing" in the log message.
Signed-off-by: Yi Cheng <[email protected]>
Signed-off-by: Yi Cheng <[email protected]>
Signed-off-by: Yi Cheng <[email protected]>
Created a ticket to track the follow up as P1. #35684 |
@rkooo567 this is also what I thought (#35546) But after following Chen's suggestion, surprisingly, there are only 2 sockets created somehow before #33769 and 3 after this. So I'm sure the one more client comes from that PR that we don't reuse the channel (py reuse it). Previously I thought TCP shouldn't be reused automatically if channel's arguments are not the same. But somehow it only increase by 2. I think I still can't understand it correctly. But this PR should be sure, we only have one GCS channel from core worker -> GCS. |
Signed-off-by: Yi Cheng <[email protected]>
Signed-off-by: Yi Cheng <[email protected]>
p = psutil.Process(gcs_server_pid) | ||
print(">>", p.num_fds()) | ||
return p.num_fds() | ||
print(">>", len(p.connections())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use connections instead of fds for better measurement of the sockets in GCS.
Signed-off-by: Yi Cheng <[email protected]>
Signed-off-by: Yi Cheng <[email protected]>
Signed-off-by: Yi Cheng <[email protected]>
The bad thing here is that working in this way somehow make the test run slower and a lot of tests got failed. This is pretty bad and hard to fix (some are because it's written bad and some are because it out of time.) One thing we can use is to don't ask them to share the channel which will give us the perfs back. The regression doesn't come from the lock. gRPC internally will try to reuse the socket if the arguments are the same and we can make use of that. But the test somehow there is one more FDs created. I don't have bandwidth to figure out what's going go, so basically just add the test by one. |
Sorry I missed your comment. Feel free to setup any meeting discussing this. |
## Why are these changes needed? After GCS client is moved to cpp, the FD usage is increased by one. Previously it's 2 and after this, it's 3. In the fix, we reuse the channel to make sure only 2 connections between GCS and CoreWorker. We still create 3 channels, but we use the same arguments to create the channels and depends on gRPC to reuse the TCP connections created. The reason why previously it's 2 hasn't been figured out. Maybe gRPC has some work hidden which can reuse the connection in sone way. ## Related issue number ray-project#34635
## Why are these changes needed? After GCS client is moved to cpp, the FD usage is increased by one. Previously it's 2 and after this, it's 3. In the fix, we reuse the channel to make sure only 2 connections between GCS and CoreWorker. We still create 3 channels, but we use the same arguments to create the channels and depends on gRPC to reuse the TCP connections created. The reason why previously it's 2 hasn't been figured out. Maybe gRPC has some work hidden which can reuse the connection in sone way. ## Related issue number #34635
## Why are these changes needed? After GCS client is moved to cpp, the FD usage is increased by one. Previously it's 2 and after this, it's 3. In the fix, we reuse the channel to make sure only 2 connections between GCS and CoreWorker. We still create 3 channels, but we use the same arguments to create the channels and depends on gRPC to reuse the TCP connections created. The reason why previously it's 2 hasn't been figured out. Maybe gRPC has some work hidden which can reuse the connection in sone way. ## Related issue number ray-project#34635
## Why are these changes needed? After GCS client is moved to cpp, the FD usage is increased by one. Previously it's 2 and after this, it's 3. In the fix, we reuse the channel to make sure only 2 connections between GCS and CoreWorker. We still create 3 channels, but we use the same arguments to create the channels and depends on gRPC to reuse the TCP connections created. The reason why previously it's 2 hasn't been figured out. Maybe gRPC has some work hidden which can reuse the connection in sone way. ## Related issue number ray-project#34635 Signed-off-by: e428265 <[email protected]>
Why are these changes needed?
After GCS client is moved to cpp, the FD usage is increased by one. Previously it's 2 and after this, it's 3.
In the fix, we reuse the channel to make sure only 2 connections between GCS and CoreWorker. We still create 3 channels, but we use the same arguments to create the channels and depends on gRPC to reuse the TCP connections created.
The reason why previously it's 2 hasn't been figured out. Maybe gRPC has some work hidden which can reuse the connection in sone way.
Related issue number
#34635
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.