-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[core][scalability] Offload RPC Finish to a thread pool. #30131
Conversation
Signed-off-by: Yi Cheng <[email protected]>
ae3e6be
to
af06b65
Compare
Still benchmarking... For shuffle_20gb_with_state_api_1668017686, the avg latency dropped for 50%. |
Signed-off-by: Yi Cheng <[email protected]>
Signed-off-by: Yi Cheng <[email protected]>
Signed-off-by: Yi Cheng <[email protected]>
new - old:
|
Signed-off-by: Yi Cheng <[email protected]>
Signed-off-by: Yi Cheng <[email protected]>
I also plan to increase the number of thread for client/server gcs gRPC to hardward concurrenty / 4 by default. I'll have a benchmark and comparison later. |
hmmm CI failed. let me check. |
Signed-off-by: Yi Cheng <[email protected]>
Signed-off-by: Yi Cheng <[email protected]>
Signed-off-by: Yi Cheng <[email protected]>
} else { | ||
const auto &destroyed_actor_iter = destroyed_actors_.find(actor_id); | ||
if (destroyed_actor_iter != destroyed_actors_.end()) { | ||
reply->unsafe_arena_set_allocated_actor_table_data( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is unrelated to the PR right? Is it to use Arena on actor related RPCs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, it is related. See here we send Finish in another thread which means these data needs to be alive until they got sent.
Here arena::create with a placement new it'll create a shared ptr so that it's shared across threads.
Otherwise, it's going to crash.
src/ray/rpc/server_call.cc
Outdated
namespace ray { | ||
namespace rpc { | ||
|
||
static std::unique_ptr<boost::asio::thread_pool> executor_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should join this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also is this thread-safe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure, but I feel no need to join (also a little bit tricky to join it. we can do).
Right now, it only calls Finish which will send event to the cq. I think grpc stop should handle this.
But I can't be 100% confident in this one.
I think the issue for the join is that, when should we join. This is a shared global thread pool among servers. So it should be joined after all thread pool destructed. And also the last one to join (it's a global vars).
// this server call might be deleted | ||
SendReply(status); | ||
boost::asio::post(GetServerCallExecutor(), | ||
[this, status]() { SendReply(status); }); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you comment SendReply
is thread-safe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, SendReply is not thread safe. Here at the same time, only one SendReply is going to be called per object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm seems SendReply is accessing shared state in this object, would this cause thread safety issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but I don't think that's the cause of the test failure
It's because grpc shutdown not taking care of in-flight requests.
But the read and write is also an issue here. Haven't seen anything wrong with it yet.
I'm going to hold the PR for a while. Still targeting 2.2.
Signed-off-by: Yi Cheng <[email protected]>
ASAN test failed. Maybe related to the join issue mentioned by @rkooo567 . Let me debug it. |
src/ray/rpc/server_call.cc
Outdated
|
||
static std::unique_ptr<boost::asio::thread_pool> executor_; | ||
|
||
boost::asio::thread_pool &GetServerCallExecutor() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use Meyer's singleton?
https://laristra.github.io/flecsi/src/developer-guide/patterns/meyers_singleton.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
boost::asio::thread_pool &GetServerCallExecutor() {
static boost::asio::thread_pool thread_pool{::RayConfig::instance().num_server_call_thread()};
return thread_pool;
}
if (registered_actor_iter != registered_actors_.end()) { | ||
reply->unsafe_arena_set_allocated_actor_table_data( | ||
registered_actor_iter->second->GetMutableActorTableData()); | ||
ptr = google::protobuf::Arena::Create<std::shared_ptr<GcsActor>>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is the shared_ptr or the actual data being created on the Arena though?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the shared_ptr
Signed-off-by: Yi Cheng <[email protected]>
Signed-off-by: Yi Cheng <[email protected]>
Signed-off-by: Yi Cheng <[email protected]>
Signed-off-by: Yi Cheng <[email protected]>
Signed-off-by: Yi Cheng <[email protected]>
)" This reverts commit b79e5b0.
…#30131) Right now, GCS/Raylet only has one thread running there. That thread is likely to become a bottleneck when load increased. For request like kv, it's really cheap, but the RPC overhead is actually big compared with the cheap operations. This potentially can cost a lot of issues and we only have one thread in the GCS/Raylet which makes the things worse. Before moving to multi-threading GCS/Raylet, one thing we can do is to execute Finish in a dedicated thread pool. Finish did a lot of things, like serialize the message which might be expensive. And Finish is easily to be offloaded from the main thread, so we can get a lot of gains. Signed-off-by: Weichen Xu <[email protected]>
…-project#30131)" (ray-project#30737) This reverts commit b79e5b0. Signed-off-by: Weichen Xu <[email protected]>
…-project#30131)" (ray-project#30737) This reverts commit b79e5b0. Signed-off-by: tmynn <[email protected]>
Why are these changes needed?
Right now, GCS/Raylet only has one thread running there. That thread is likely to become a bottleneck when load increased.
For request like kv, it's really cheap, but the RPC overhead is actually big compared with the cheap operations.
This potentially can cost a lot of issues and we only have one thread in the GCS/Raylet which makes the things worse.
Before moving to multi-threading GCS/Raylet, one thing we can do is to execute Finish in a dedicated thread pool.
Finish did a lot of things, like serialize the message which might be expensive. And Finish is easily to be offloaded from the main thread, so we can get a lot of gains.
Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.