[core][scalability] Offload RPC Finish to a thread pool. #30131

fishbone · 2022-11-09T06:39:08Z

Why are these changes needed?

Right now, GCS/Raylet only has one thread running there. That thread is likely to become a bottleneck when load increased.

For request like kv, it's really cheap, but the RPC overhead is actually big compared with the cheap operations.
This potentially can cost a lot of issues and we only have one thread in the GCS/Raylet which makes the things worse.

Before moving to multi-threading GCS/Raylet, one thing we can do is to execute Finish in a dedicated thread pool.

Finish did a lot of things, like serialize the message which might be expensive. And Finish is easily to be offloaded from the main thread, so we can get a lot of gains.

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Yi Cheng <[email protected]>

fishbone · 2022-11-09T18:50:31Z

Still benchmarking...

For shuffle_20gb_with_state_api_1668017686, the avg latency dropped for 50%.

Signed-off-by: Yi Cheng <[email protected]>

fishbone · 2022-11-09T19:37:25Z

state api stress test:
new vs old

fishbone · 2022-11-09T19:39:25Z

new - old:

time 1662.035492441 1018.922204774 -643.1132876670001
_peak_memory 50.67 48.22 -2.450000000000003
max_list_tasks_0_latency_sec 0.027965373000029103 0.027747097000030863 -0.00021827599999824088
avg_list_tasks_0_latency_sec 0.015576935125000801 0.014478697500006632 -0.0010982376249941694
p99_list_tasks_0_latency_sec 0.027738109320026183 0.027237273980029213 -0.0005008353399969696
p95_list_tasks_0_latency_sec 0.0268290546000145 0.02519798190002262 -0.0016310726999918812
p50_list_tasks_0_latency_sec 0.014137645500014173 0.012979032499998766 -0.0011586130000154071
max_list_tasks_1_latency_sec 0.006740087999986599 0.006368699999995897 -0.00037138799999070216
avg_list_tasks_1_latency_sec 0.0060334214000022255 0.005611180799999715 -0.0004222406000025103
p99_list_tasks_1_latency_sec 0.006678153329988561 0.006307836239996618 -0.00037031708999194293
p95_list_tasks_1_latency_sec 0.006430414649996407 0.006064381199999502 -0.0003660334499969052
p50_list_tasks_1_latency_sec 0.005966009000005101 0.005520626500000958 -0.00044538250000414337
max_list_tasks_100_latency_sec 0.1745103259999894 0.14157383200000595 -0.03293649399998344
avg_list_tasks_100_latency_sec 0.057074867899996204 0.04820108139999775 -0.008873786499998454
p99_list_tasks_100_latency_sec 0.16294573099998816 0.13291759069000508 -0.030028140309983076
p95_list_tasks_100_latency_sec 0.11668735099998301 0.09829262545000138 -0.01839472554998163
p50_list_tasks_100_latency_sec 0.04422751449999396 0.03705438400000105 -0.007173130499992908
max_list_tasks_1000_latency_sec 0.5022654530000068 0.45098966399999085 -0.05127578900001595
avg_list_tasks_1000_latency_sec 0.27970335160000276 0.26352176380000286 -0.016181587799999897
p99_list_tasks_1000_latency_sec 0.4902454970600056 0.4461178757999912 -0.044127621260014405
p95_list_tasks_1000_latency_sec 0.44216567330000045 0.4266307229999924 -0.015534950300008066
p50_list_tasks_1000_latency_sec 0.26770244200000093 0.2265817244999937 -0.04112071750000723
max_list_tasks_10000_latency_sec 2.9180097739999837 2.2827197780000006 -0.6352899959999831
avg_list_tasks_10000_latency_sec 1.937738225800001 1.6470455091000047 -0.2906927166999964
p99_list_tasks_10000_latency_sec 2.915159749039989 2.273735351510004 -0.6414243975299851
p95_list_tasks_10000_latency_sec 2.903759649200012 2.2377976455500175 -0.6659620036499945
p50_list_tasks_10000_latency_sec 1.7550061604999883 1.5433558119999873 -0.21165034850000097
max_list_actors_0_latency_sec 36.43849458699998 29.32838794899999 -7.110106637999991
avg_list_actors_0_latency_sec 4.740787761124999 3.847403464749995 -0.8933842963750038
p99_list_actors_0_latency_sec 33.96591196098997 27.34950027151998 -6.616411689469992
p95_list_actors_0_latency_sec 24.07558145694997 19.43394956159997 -4.641631895349999
p50_list_actors_0_latency_sec 0.031247525999987147 0.026435643500008155 -0.004811882499978992
max_list_actors_1_latency_sec 0.013958451000007699 0.009105063999982121 -0.0048533870000255774
avg_list_actors_1_latency_sec 0.01236417840000854 0.007317010299999538 -0.005047168100009002
p99_list_actors_1_latency_sec 0.013946628960007956 0.008964937869982918 -0.004981691090025038
p95_list_actors_1_latency_sec 0.013899340800008986 0.008404433349986105 -0.00549490745002288
p50_list_actors_1_latency_sec 0.011935257499999352 0.007208334500006686 -0.0047269229999926665
max_list_actors_100_latency_sec 0.030689091999988705 0.03689054100004796 0.006201449000059256
avg_list_actors_100_latency_sec 0.02820618149999632 0.027369357000009132 -0.000836824499987187
p99_list_actors_100_latency_sec 0.030563218629990844 0.036775455480043885 0.006212236850053041
p95_list_actors_100_latency_sec 0.030059725149999394 0.0363151134000276 0.006255388250028204
p50_list_actors_100_latency_sec 0.027676283000005242 0.025401503499978162 -0.00227477950002708
max_list_actors_1000_latency_sec 0.3905264750000015 0.3222996469999657 -0.06822682800003577
avg_list_actors_1000_latency_sec 0.2837003983999978 0.24619234309999455 -0.03750805530000323
p99_list_actors_1000_latency_sec 0.384842696540004 0.3190747790599687 -0.0657679174800353
p95_list_actors_1000_latency_sec 0.362107582700014 0.3061753072999806 -0.05593227540003337
p50_list_actors_1000_latency_sec 0.265682733999995 0.23279061349998642 -0.03289212050000856
max_list_actors_5000_latency_sec 62.375020518999975 43.517119718000004 -18.85790080099997
avg_list_actors_5000_latency_sec 10.530231458000003 9.679457610400004 -0.8507738475999993
p99_list_actors_5000_latency_sec 58.89878002704998 42.22651914002 -16.672260887029978
p95_list_actors_5000_latency_sec 44.993818059249946 37.06411682809997 -7.929701231149977
p50_list_actors_5000_latency_sec 2.400242450999997 2.579894201500025 0.17965175050002813
max_list_objects_100_latency_sec 0.029857513999957064 0.027603704000000562 -0.002253809999956502
avg_list_objects_100_latency_sec 0.024752121800003125 0.02285384150000027 -0.0018982803000028546
p99_list_objects_100_latency_sec 0.029392055149962175 0.027193184839999846 -0.0021988703099623287
p95_list_objects_100_latency_sec 0.027530219749982616 0.02555110819999697 -0.001979111549985646
p50_list_objects_100_latency_sec 0.024155908500006262 0.022444910500013293 -0.0017109979999929692
max_list_objects_1000_latency_sec 0.19521447599998965 0.1923943980000331 -0.0028200779999565384
avg_list_objects_1000_latency_sec 0.12590626280000947 0.11367335539999886 -0.01223290740001061
p99_list_objects_1000_latency_sec 0.19082364293999093 0.18658159437002894 -0.0042420485699619925
p95_list_objects_1000_latency_sec 0.17326031069999595 0.1633303798500122 -0.009929930849983754
p50_list_objects_1000_latency_sec 0.1158020310000154 0.10167875049998543 -0.014123280500029978
max_list_objects_10000_latency_sec 1.1914719459999787 1.0252319720000287 -0.16623997399995005
avg_list_objects_10000_latency_sec 1.1425278373000025 1.0020041844000105 -0.14052365289999202
p99_list_objects_10000_latency_sec 1.1888790053199774 1.0244472553400288 -0.16443174997994858
p95_list_objects_10000_latency_sec 1.1785072425999714 1.0213083887000294 -0.15719885389994204
p50_list_objects_10000_latency_sec 1.1387959840000121 1.0048665379999875 -0.13392944600002465
max_list_objects_50000_latency_sec 6.090131982999992 5.233914557000048 -0.8562174259999438
avg_list_objects_50000_latency_sec 5.914877249299986 5.057253843100011 -0.8576234061999743
p99_list_objects_50000_latency_sec 6.088506692349992 5.229219880970045 -0.8592868113799472
p95_list_objects_50000_latency_sec 6.082005529749995 5.210441176850031 -0.8715643528999637
p50_list_objects_50000_latency_sec 5.900189081499946 5.050077599000019 -0.8501114824999263
max_get_log_latency_sec 6.365316481982518 6.003401012015956 -0.36191546996656143
avg_get_log_latency_sec 2.793692241997595 2.8489423350044567 0.05525009300686179
p99_get_log_latency_sec 6.271445683923003 5.924650139075586 -0.34679554484741626
p95_get_log_latency_sec 5.895962491684941 5.609646647314104 -0.28631584437083646
p50_get_log_latency_sec 1.671776579006746 2.0658573649974414 0.3940807859906954
avg_state_api_latency_sec 1.8372290429264182 1.622403640485395 -0.21482540244102322

Signed-off-by: Yi Cheng <[email protected]>

fishbone · 2022-11-09T19:49:59Z

I also plan to increase the number of thread for client/server gcs gRPC to hardward concurrenty / 4 by default. I'll have a benchmark and comparison later.

fishbone · 2022-11-09T21:16:47Z

hmmm CI failed. let me check.

Signed-off-by: Yi Cheng <[email protected]>

rkooo567 · 2022-11-10T02:04:33Z

src/ray/gcs/gcs_server/gcs_actor_manager.cc

  } else {
    const auto &destroyed_actor_iter = destroyed_actors_.find(actor_id);
    if (destroyed_actor_iter != destroyed_actors_.end()) {
-      reply->unsafe_arena_set_allocated_actor_table_data(


This is unrelated to the PR right? Is it to use Arena on actor related RPCs?

No, it is related. See here we send Finish in another thread which means these data needs to be alive until they got sent.
Here arena::create with a placement new it'll create a shared ptr so that it's shared across threads.

Otherwise, it's going to crash.

rkooo567 · 2022-11-10T02:05:27Z

src/ray/rpc/server_call.cc

+namespace ray {
+namespace rpc {
+
+static std::unique_ptr<boost::asio::thread_pool> executor_;


Maybe we should join this?

also is this thread-safe?

I'm not sure, but I feel no need to join (also a little bit tricky to join it. we can do).

Right now, it only calls Finish which will send event to the cq. I think grpc stop should handle this.

But I can't be 100% confident in this one.

I think the issue for the join is that, when should we join. This is a shared global thread pool among servers. So it should be joined after all thread pool destructed. And also the last one to join (it's a global vars).

rkooo567 · 2022-11-10T02:06:12Z

src/ray/rpc/server_call.h

-          // this server call might be deleted
-          SendReply(status);
+          boost::asio::post(GetServerCallExecutor(),
+                            [this, status]() { SendReply(status); });


Can you comment SendReply is thread-safe?

No, SendReply is not thread safe. Here at the same time, only one SendReply is going to be called per object.

hmm seems SendReply is accessing shared state in this object, would this cause thread safety issue?

Yes, but I don't think that's the cause of the test failure
It's because grpc shutdown not taking care of in-flight requests.

But the read and write is also an issue here. Haven't seen anything wrong with it yet.

I'm going to hold the PR for a while. Still targeting 2.2.

Signed-off-by: Yi Cheng <[email protected]>

fishbone · 2022-11-10T04:07:47Z

ASAN test failed. Maybe related to the join issue mentioned by @rkooo567 . Let me debug it.

scv119 · 2022-11-11T03:03:40Z

src/ray/rpc/server_call.cc

+
+static std::unique_ptr<boost::asio::thread_pool> executor_;
+
+boost::asio::thread_pool &GetServerCallExecutor() {


use Meyer's singleton?
https://laristra.github.io/flecsi/src/developer-guide/patterns/meyers_singleton.html

boost::asio::thread_pool &GetServerCallExecutor() { static boost::asio::thread_pool thread_pool{::RayConfig::instance().num_server_call_thread()}; return thread_pool; }

scv119 · 2022-11-11T03:25:59Z

src/ray/gcs/gcs_server/gcs_actor_manager.cc

  if (registered_actor_iter != registered_actors_.end()) {
-    reply->unsafe_arena_set_allocated_actor_table_data(
-        registered_actor_iter->second->GetMutableActorTableData());
+    ptr = google::protobuf::Arena::Create<std::shared_ptr<GcsActor>>(


is the shared_ptr or the actual data being created on the Arena though?

the shared_ptr

Signed-off-by: Yi Cheng <[email protected]>

)" This reverts commit b79e5b0.

)" (#30737) This reverts commit b79e5b0.

…ol. (#30131)" (#30737)" This reverts commit 3989712.

…#30131) Right now, GCS/Raylet only has one thread running there. That thread is likely to become a bottleneck when load increased. For request like kv, it's really cheap, but the RPC overhead is actually big compared with the cheap operations. This potentially can cost a lot of issues and we only have one thread in the GCS/Raylet which makes the things worse. Before moving to multi-threading GCS/Raylet, one thing we can do is to execute Finish in a dedicated thread pool. Finish did a lot of things, like serialize the message which might be expensive. And Finish is easily to be offloaded from the main thread, so we can get a lot of gains. Signed-off-by: Weichen Xu <[email protected]>

…-project#30131)" (ray-project#30737) This reverts commit b79e5b0. Signed-off-by: Weichen Xu <[email protected]>

…-project#30131)" (ray-project#30737) This reverts commit b79e5b0. Signed-off-by: tmynn <[email protected]>

up

af06b65

Signed-off-by: Yi Cheng <[email protected]>

fishbone force-pushed the offload-to-rpc branch from ae3e6be to af06b65 Compare November 9, 2022 18:46

fishbone changed the title ~~Offload to rpc~~ [core] Offload RPC Finish to a thread pool. Nov 9, 2022

fishbone assigned scv119 and rkooo567 Nov 9, 2022

fishbone marked this pull request as ready for review November 9, 2022 18:50

fishbone added 3 commits November 9, 2022 18:54

fix

5e6bd2b

Signed-off-by: Yi Cheng <[email protected]>

update

acc4bb0

Signed-off-by: Yi Cheng <[email protected]>

up

8a7464e

Signed-off-by: Yi Cheng <[email protected]>

fishbone added 2 commits November 9, 2022 19:46

up

8726340

Signed-off-by: Yi Cheng <[email protected]>

up

990fcb0

Signed-off-by: Yi Cheng <[email protected]>

fishbone mentioned this pull request Nov 9, 2022

[Core] GCS hanging even cpu utilization low #30086

Closed

fishbone linked an issue Nov 9, 2022 that may be closed by this pull request

[Core] GCS hanging even cpu utilization low #30086

Closed

fishbone added 3 commits November 9, 2022 22:41

up

8616f46

Signed-off-by: Yi Cheng <[email protected]>

fix

ccee3f5

Signed-off-by: Yi Cheng <[email protected]>

up

79af2c4

Signed-off-by: Yi Cheng <[email protected]>

rkooo567 approved these changes Nov 10, 2022

View reviewed changes

rkooo567 added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Nov 10, 2022

fishbone removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Nov 10, 2022

up

1cd619f

Signed-off-by: Yi Cheng <[email protected]>

fishbone added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Nov 10, 2022

scv119 reviewed Nov 11, 2022

View reviewed changes

fishbone added the do-not-merge Do not merge this PR! label Nov 11, 2022

fishbone added 5 commits November 23, 2022 22:57

fix

1c50e99

Signed-off-by: Yi Cheng <[email protected]>

up

e153332

Signed-off-by: Yi Cheng <[email protected]>

Merge remote-tracking branch 'upstream/master' into offload-to-rpc

1101edb

Signed-off-by: Yi Cheng <[email protected]>

increase cli

ab19fb9

Signed-off-by: Yi Cheng <[email protected]>

Merge remote-tracking branch 'upstream/master' into offload-to-rpc

6d47b9b

Signed-off-by: Yi Cheng <[email protected]>

fishbone merged commit b79e5b0 into ray-project:master Nov 24, 2022

rkooo567 mentioned this pull request Nov 28, 2022

[Core][nightly-test] long_running_many_drivers is flaky due to segmentation fault in GcsRpcClient GcsSubscribePool #30679

Closed

stephanie-wang added a commit that referenced this pull request Nov 29, 2022

Revert "[core][scalability] Offload RPC Finish to a thread pool. (#30131

047bd20

)" This reverts commit b79e5b0.

stephanie-wang mentioned this pull request Nov 29, 2022

Revert "[core][scalability] Offload RPC Finish to a thread pool." #30737

Merged

fishbone pushed a commit that referenced this pull request Nov 30, 2022

Revert "[core][scalability] Offload RPC Finish to a thread pool. (#30131

3989712

)" (#30737) This reverts commit b79e5b0.

fishbone added a commit that referenced this pull request Nov 30, 2022

Revert "Revert "[core][scalability] Offload RPC Finish to a thread po…

f72ce8f

…ol. (#30131)" (#30737)" This reverts commit 3989712.

WeichenXu123 pushed a commit to WeichenXu123/ray that referenced this pull request Dec 19, 2022

Revert "[core][scalability] Offload RPC Finish to a thread pool. (ray…

797285d

…-project#30131)" (ray-project#30737) This reverts commit b79e5b0. Signed-off-by: Weichen Xu <[email protected]>

tamohannes pushed a commit to ju2ez/ray that referenced this pull request Jan 25, 2023

Revert "[core][scalability] Offload RPC Finish to a thread pool. (ray…

ddab3e4

…-project#30131)" (ray-project#30737) This reverts commit b79e5b0. Signed-off-by: tmynn <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[core][scalability] Offload RPC Finish to a thread pool. #30131

[core][scalability] Offload RPC Finish to a thread pool. #30131

fishbone commented Nov 9, 2022 •

edited

Loading

fishbone commented Nov 9, 2022 •

edited

Loading

fishbone commented Nov 9, 2022

fishbone commented Nov 9, 2022

fishbone commented Nov 9, 2022

fishbone commented Nov 9, 2022

rkooo567 Nov 10, 2022

fishbone Nov 10, 2022

rkooo567 Nov 10, 2022

rkooo567 Nov 10, 2022

fishbone Nov 10, 2022

rkooo567 Nov 10, 2022

fishbone Nov 10, 2022

scv119 Nov 11, 2022

fishbone Nov 11, 2022

fishbone commented Nov 10, 2022

scv119 Nov 11, 2022

scv119 Nov 11, 2022

scv119 Nov 11, 2022

fishbone Nov 11, 2022


		static std::unique_ptr<boost::asio::thread_pool> executor_;

		boost::asio::thread_pool &GetServerCallExecutor() {

[core][scalability] Offload RPC Finish to a thread pool. #30131

[core][scalability] Offload RPC Finish to a thread pool. #30131

Conversation

fishbone commented Nov 9, 2022 • edited Loading

Why are these changes needed?

Related issue number

Checks

fishbone commented Nov 9, 2022 • edited Loading

fishbone commented Nov 9, 2022

fishbone commented Nov 9, 2022

fishbone commented Nov 9, 2022

fishbone commented Nov 9, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fishbone commented Nov 10, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fishbone commented Nov 9, 2022 •

edited

Loading

fishbone commented Nov 9, 2022 •

edited

Loading