Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core][scalability] Offload RPC Finish to a thread pool. #30131

Merged
merged 34 commits into from
Nov 24, 2022

Conversation

fishbone
Copy link
Contributor

@fishbone fishbone commented Nov 9, 2022

Why are these changes needed?

Right now, GCS/Raylet only has one thread running there. That thread is likely to become a bottleneck when load increased.

For request like kv, it's really cheap, but the RPC overhead is actually big compared with the cheap operations.
This potentially can cost a lot of issues and we only have one thread in the GCS/Raylet which makes the things worse.

Before moving to multi-threading GCS/Raylet, one thing we can do is to execute Finish in a dedicated thread pool.

Finish did a lot of things, like serialize the message which might be expensive. And Finish is easily to be offloaded from the main thread, so we can get a lot of gains.

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: Yi Cheng <[email protected]>
@fishbone fishbone changed the title Offload to rpc [core] Offload RPC Finish to a thread pool. Nov 9, 2022
@fishbone
Copy link
Contributor Author

fishbone commented Nov 9, 2022

Still benchmarking...

For shuffle_20gb_with_state_api_1668017686, the avg latency dropped for 50%.

@fishbone fishbone marked this pull request as ready for review November 9, 2022 18:50
Signed-off-by: Yi Cheng <[email protected]>
Signed-off-by: Yi Cheng <[email protected]>
Signed-off-by: Yi Cheng <[email protected]>
@fishbone
Copy link
Contributor Author

fishbone commented Nov 9, 2022

state api stress test:
new vs old

@fishbone
Copy link
Contributor Author

fishbone commented Nov 9, 2022

new - old:

time 1662.035492441 1018.922204774 -643.1132876670001
_peak_memory 50.67 48.22 -2.450000000000003
max_list_tasks_0_latency_sec 0.027965373000029103 0.027747097000030863 -0.00021827599999824088
avg_list_tasks_0_latency_sec 0.015576935125000801 0.014478697500006632 -0.0010982376249941694
p99_list_tasks_0_latency_sec 0.027738109320026183 0.027237273980029213 -0.0005008353399969696
p95_list_tasks_0_latency_sec 0.0268290546000145 0.02519798190002262 -0.0016310726999918812
p50_list_tasks_0_latency_sec 0.014137645500014173 0.012979032499998766 -0.0011586130000154071
max_list_tasks_1_latency_sec 0.006740087999986599 0.006368699999995897 -0.00037138799999070216
avg_list_tasks_1_latency_sec 0.0060334214000022255 0.005611180799999715 -0.0004222406000025103
p99_list_tasks_1_latency_sec 0.006678153329988561 0.006307836239996618 -0.00037031708999194293
p95_list_tasks_1_latency_sec 0.006430414649996407 0.006064381199999502 -0.0003660334499969052
p50_list_tasks_1_latency_sec 0.005966009000005101 0.005520626500000958 -0.00044538250000414337
max_list_tasks_100_latency_sec 0.1745103259999894 0.14157383200000595 -0.03293649399998344
avg_list_tasks_100_latency_sec 0.057074867899996204 0.04820108139999775 -0.008873786499998454
p99_list_tasks_100_latency_sec 0.16294573099998816 0.13291759069000508 -0.030028140309983076
p95_list_tasks_100_latency_sec 0.11668735099998301 0.09829262545000138 -0.01839472554998163
p50_list_tasks_100_latency_sec 0.04422751449999396 0.03705438400000105 -0.007173130499992908
max_list_tasks_1000_latency_sec 0.5022654530000068 0.45098966399999085 -0.05127578900001595
avg_list_tasks_1000_latency_sec 0.27970335160000276 0.26352176380000286 -0.016181587799999897
p99_list_tasks_1000_latency_sec 0.4902454970600056 0.4461178757999912 -0.044127621260014405
p95_list_tasks_1000_latency_sec 0.44216567330000045 0.4266307229999924 -0.015534950300008066
p50_list_tasks_1000_latency_sec 0.26770244200000093 0.2265817244999937 -0.04112071750000723
max_list_tasks_10000_latency_sec 2.9180097739999837 2.2827197780000006 -0.6352899959999831
avg_list_tasks_10000_latency_sec 1.937738225800001 1.6470455091000047 -0.2906927166999964
p99_list_tasks_10000_latency_sec 2.915159749039989 2.273735351510004 -0.6414243975299851
p95_list_tasks_10000_latency_sec 2.903759649200012 2.2377976455500175 -0.6659620036499945
p50_list_tasks_10000_latency_sec 1.7550061604999883 1.5433558119999873 -0.21165034850000097
max_list_actors_0_latency_sec 36.43849458699998 29.32838794899999 -7.110106637999991
avg_list_actors_0_latency_sec 4.740787761124999 3.847403464749995 -0.8933842963750038
p99_list_actors_0_latency_sec 33.96591196098997 27.34950027151998 -6.616411689469992
p95_list_actors_0_latency_sec 24.07558145694997 19.43394956159997 -4.641631895349999
p50_list_actors_0_latency_sec 0.031247525999987147 0.026435643500008155 -0.004811882499978992
max_list_actors_1_latency_sec 0.013958451000007699 0.009105063999982121 -0.0048533870000255774
avg_list_actors_1_latency_sec 0.01236417840000854 0.007317010299999538 -0.005047168100009002
p99_list_actors_1_latency_sec 0.013946628960007956 0.008964937869982918 -0.004981691090025038
p95_list_actors_1_latency_sec 0.013899340800008986 0.008404433349986105 -0.00549490745002288
p50_list_actors_1_latency_sec 0.011935257499999352 0.007208334500006686 -0.0047269229999926665
max_list_actors_100_latency_sec 0.030689091999988705 0.03689054100004796 0.006201449000059256
avg_list_actors_100_latency_sec 0.02820618149999632 0.027369357000009132 -0.000836824499987187
p99_list_actors_100_latency_sec 0.030563218629990844 0.036775455480043885 0.006212236850053041
p95_list_actors_100_latency_sec 0.030059725149999394 0.0363151134000276 0.006255388250028204
p50_list_actors_100_latency_sec 0.027676283000005242 0.025401503499978162 -0.00227477950002708
max_list_actors_1000_latency_sec 0.3905264750000015 0.3222996469999657 -0.06822682800003577
avg_list_actors_1000_latency_sec 0.2837003983999978 0.24619234309999455 -0.03750805530000323
p99_list_actors_1000_latency_sec 0.384842696540004 0.3190747790599687 -0.0657679174800353
p95_list_actors_1000_latency_sec 0.362107582700014 0.3061753072999806 -0.05593227540003337
p50_list_actors_1000_latency_sec 0.265682733999995 0.23279061349998642 -0.03289212050000856
max_list_actors_5000_latency_sec 62.375020518999975 43.517119718000004 -18.85790080099997
avg_list_actors_5000_latency_sec 10.530231458000003 9.679457610400004 -0.8507738475999993
p99_list_actors_5000_latency_sec 58.89878002704998 42.22651914002 -16.672260887029978
p95_list_actors_5000_latency_sec 44.993818059249946 37.06411682809997 -7.929701231149977
p50_list_actors_5000_latency_sec 2.400242450999997 2.579894201500025 0.17965175050002813
max_list_objects_100_latency_sec 0.029857513999957064 0.027603704000000562 -0.002253809999956502
avg_list_objects_100_latency_sec 0.024752121800003125 0.02285384150000027 -0.0018982803000028546
p99_list_objects_100_latency_sec 0.029392055149962175 0.027193184839999846 -0.0021988703099623287
p95_list_objects_100_latency_sec 0.027530219749982616 0.02555110819999697 -0.001979111549985646
p50_list_objects_100_latency_sec 0.024155908500006262 0.022444910500013293 -0.0017109979999929692
max_list_objects_1000_latency_sec 0.19521447599998965 0.1923943980000331 -0.0028200779999565384
avg_list_objects_1000_latency_sec 0.12590626280000947 0.11367335539999886 -0.01223290740001061
p99_list_objects_1000_latency_sec 0.19082364293999093 0.18658159437002894 -0.0042420485699619925
p95_list_objects_1000_latency_sec 0.17326031069999595 0.1633303798500122 -0.009929930849983754
p50_list_objects_1000_latency_sec 0.1158020310000154 0.10167875049998543 -0.014123280500029978
max_list_objects_10000_latency_sec 1.1914719459999787 1.0252319720000287 -0.16623997399995005
avg_list_objects_10000_latency_sec 1.1425278373000025 1.0020041844000105 -0.14052365289999202
p99_list_objects_10000_latency_sec 1.1888790053199774 1.0244472553400288 -0.16443174997994858
p95_list_objects_10000_latency_sec 1.1785072425999714 1.0213083887000294 -0.15719885389994204
p50_list_objects_10000_latency_sec 1.1387959840000121 1.0048665379999875 -0.13392944600002465
max_list_objects_50000_latency_sec 6.090131982999992 5.233914557000048 -0.8562174259999438
avg_list_objects_50000_latency_sec 5.914877249299986 5.057253843100011 -0.8576234061999743
p99_list_objects_50000_latency_sec 6.088506692349992 5.229219880970045 -0.8592868113799472
p95_list_objects_50000_latency_sec 6.082005529749995 5.210441176850031 -0.8715643528999637
p50_list_objects_50000_latency_sec 5.900189081499946 5.050077599000019 -0.8501114824999263
max_get_log_latency_sec 6.365316481982518 6.003401012015956 -0.36191546996656143
avg_get_log_latency_sec 2.793692241997595 2.8489423350044567 0.05525009300686179
p99_get_log_latency_sec 6.271445683923003 5.924650139075586 -0.34679554484741626
p95_get_log_latency_sec 5.895962491684941 5.609646647314104 -0.28631584437083646
p50_get_log_latency_sec 1.671776579006746 2.0658573649974414 0.3940807859906954
avg_state_api_latency_sec 1.8372290429264182 1.622403640485395 -0.21482540244102322

Signed-off-by: Yi Cheng <[email protected]>
Signed-off-by: Yi Cheng <[email protected]>
@fishbone
Copy link
Contributor Author

fishbone commented Nov 9, 2022

I also plan to increase the number of thread for client/server gcs gRPC to hardward concurrenty / 4 by default. I'll have a benchmark and comparison later.

@fishbone fishbone linked an issue Nov 9, 2022 that may be closed by this pull request
@fishbone
Copy link
Contributor Author

fishbone commented Nov 9, 2022

hmmm CI failed. let me check.

Signed-off-by: Yi Cheng <[email protected]>
Signed-off-by: Yi Cheng <[email protected]>
Signed-off-by: Yi Cheng <[email protected]>
} else {
const auto &destroyed_actor_iter = destroyed_actors_.find(actor_id);
if (destroyed_actor_iter != destroyed_actors_.end()) {
reply->unsafe_arena_set_allocated_actor_table_data(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is unrelated to the PR right? Is it to use Arena on actor related RPCs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it is related. See here we send Finish in another thread which means these data needs to be alive until they got sent.
Here arena::create with a placement new it'll create a shared ptr so that it's shared across threads.

Otherwise, it's going to crash.

namespace ray {
namespace rpc {

static std::unique_ptr<boost::asio::thread_pool> executor_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should join this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also is this thread-safe?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure, but I feel no need to join (also a little bit tricky to join it. we can do).

Right now, it only calls Finish which will send event to the cq. I think grpc stop should handle this.

But I can't be 100% confident in this one.

I think the issue for the join is that, when should we join. This is a shared global thread pool among servers. So it should be joined after all thread pool destructed. And also the last one to join (it's a global vars).

// this server call might be deleted
SendReply(status);
boost::asio::post(GetServerCallExecutor(),
[this, status]() { SendReply(status); });
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you comment SendReply is thread-safe?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, SendReply is not thread safe. Here at the same time, only one SendReply is going to be called per object.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm seems SendReply is accessing shared state in this object, would this cause thread safety issue?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but I don't think that's the cause of the test failure
It's because grpc shutdown not taking care of in-flight requests.

But the read and write is also an issue here. Haven't seen anything wrong with it yet.

I'm going to hold the PR for a while. Still targeting 2.2.

@rkooo567 rkooo567 added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Nov 10, 2022
@fishbone fishbone removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Nov 10, 2022
Signed-off-by: Yi Cheng <[email protected]>
@fishbone
Copy link
Contributor Author

ASAN test failed. Maybe related to the join issue mentioned by @rkooo567 . Let me debug it. 

@fishbone fishbone added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Nov 10, 2022

static std::unique_ptr<boost::asio::thread_pool> executor_;

boost::asio::thread_pool &GetServerCallExecutor() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

boost::asio::thread_pool &GetServerCallExecutor() {
   static boost::asio::thread_pool thread_pool{::RayConfig::instance().num_server_call_thread()};
  return thread_pool;
}

if (registered_actor_iter != registered_actors_.end()) {
reply->unsafe_arena_set_allocated_actor_table_data(
registered_actor_iter->second->GetMutableActorTableData());
ptr = google::protobuf::Arena::Create<std::shared_ptr<GcsActor>>(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the shared_ptr or the actual data being created on the Arena though?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the shared_ptr

@fishbone fishbone added the do-not-merge Do not merge this PR! label Nov 11, 2022
@fishbone fishbone merged commit b79e5b0 into ray-project:master Nov 24, 2022
stephanie-wang added a commit that referenced this pull request Nov 29, 2022
fishbone pushed a commit that referenced this pull request Nov 30, 2022
fishbone added a commit that referenced this pull request Nov 30, 2022
WeichenXu123 pushed a commit to WeichenXu123/ray that referenced this pull request Dec 19, 2022
…#30131)

Right now, GCS/Raylet only has one thread running there. That thread is likely to become a bottleneck when load increased.

For request like kv, it's really cheap, but the RPC overhead is actually big compared with the cheap operations.
This potentially can cost a lot of issues and we only have one thread in the GCS/Raylet which makes the things worse.

Before moving to multi-threading GCS/Raylet, one thing we can do is to execute Finish in a dedicated thread pool.

Finish did a lot of things, like serialize the message which might be expensive. And Finish is easily to be offloaded from the main thread, so we can get a lot of gains.

Signed-off-by: Weichen Xu <[email protected]>
WeichenXu123 pushed a commit to WeichenXu123/ray that referenced this pull request Dec 19, 2022
tamohannes pushed a commit to ju2ez/ray that referenced this pull request Jan 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Core] GCS hanging even cpu utilization low
3 participants