[core] Introduce pull based health check to GCS. #29442

fishbone · 2022-10-18T21:06:53Z

Why are these changes needed?

This PR introduced the pull-based health check to GCS. This is to fix the false positive issues when GCS is overloaded and incorrectly marks the healthy node as dead.

The health check service in each ray component is implemented using gRPC built-in services. This PR focus on the client-side health check.

The following features are supported:

Initial delay when a new node is added. This is for the new node to be able to ramp up.
Timeout for an RPC: in case of network issues, we introduce timeout, and the request fails to return within timeout is considered a failure.
If the health check failed X times consecutively, the node will be considered as dead.
We also introduce the interval that can be configured between two health checks sent.

This client doesn't send two health checks in parallel, so the next one always waits until the first one is finished.

This work has reference to k8s's healthiness probe features.

A feature flag is introduced to turn it on or off and it's turned on in #29536

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Yi Cheng <[email protected]>

scv119 · 2022-10-24T17:08:34Z

src/ray/gcs/gcs_server/gcs_health_check_manager.h

+
+  /// The context of the health check for each nodes.
+  absl::flat_hash_map<NodeID, std::unique_ptr<HealthCheckContext>>
+      inflight_health_checks_;


would health_checked_nodes_ or simply nodes_ be a better name?

maybe health_check_contexts_?

src/ray/gcs/gcs_server/gcs_health_check_manager.h

scv119 · 2022-10-24T17:12:58Z

src/ray/gcs/gcs_server/gcs_health_check_manager.h

+  /// \param on_node_death_callback The callback function when some node is marked as
+  /// failure. \param initial_delay_ms The delay for the first health check. \param
+  /// period_ms The interval between two health checks for the same node. \param
+  /// failure_threshold The threshold before a node will be marked as dead due to health


failure_threshold -> num_consecutive_failures_threshold?

I borrow the terminology from k8s (https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/#configure-probes)

Since the whole protocol also has referred a lot from that, I think maybe we should keep it?

(no strong opinion).

scv119 · 2022-10-24T17:15:36Z

Looks very clean! just minor comments.

Signed-off-by: Yi Cheng <[email protected]>

fishbone · 2022-10-25T05:39:28Z

mac unit test failed //:gcs_health_check_manager_test
checking

rkooo567 · 2022-10-25T14:50:58Z

@iycheng can you give me one more day. I will try reviewing this by EoD tomorrow in my time zone

Signed-off-by: Yi Cheng <[email protected]>

rkooo567 · 2022-10-26T16:20:25Z

python/ray/tests/test_metrics_agent.py

@@ -78,6 +77,9 @@
 if not ray._raylet.Config.use_ray_syncer():
    _METRICS.append("ray_outbound_heartbeat_size_kb_sum")

+if not ray._raylet.Config.pull_based_healthcheck():


Consider adding metrics for pull based heartbeat too?

could be in other PR

I could add it in this PR

@rkooo567 I think we should do a better job to monitor the RPC requests.
The cpp -> python -> opencensus is not a good practice I believe.

I can add gRPC metrics here. But in long term, we should do cpp -> opencensus agent directly.
We can use tools builtin gRPC to do this (https://github.com/grpc/grpc/blob/5f6c357e741207f4af7e3b642a486bdda12c93df/include/grpcpp/opencensus.h#L35)

rkooo567 · 2022-10-26T16:22:59Z

src/ray/common/ray_config_def.h

+/// A feature flag to enable pull based health check.
+/// TODO: Turn it on by default
+RAY_CONFIG(bool, pull_based_healthcheck, false)
+RAY_CONFIG(int64_t, health_check_initial_delay_ms, 1000)


Feel like it is unnecessary? It'd take 30 seconds until the node is marked as dead anyway?

No. 30s is old protocol. This is the new protocol. So here, if a node is dead, it only need to take 3s to mark the node dead.

rkooo567 · 2022-10-26T16:24:19Z

src/ray/gcs/gcs_server/gcs_server.cc

+
+  if (RayConfig::instance().pull_based_healthcheck()) {
+    gcs_healthcheck_manager_ =
+        std::make_unique<GcsHealthCheckManager>(main_service_, node_death_callback);


Why don't we run this in heartbeat_manager_io_service_?

I'm trying to make it simple. As we know, it doesn't help a lot putting it on a separate thread for this case (we are doing this and seeing issues) since the bottleneck is not there.

I would prefer this way and if it's the bottleneck, tuning it later.

Hmm still prefer not to rely on this sort of critical operation on the main thread as I've seen main thread being blocked > 10 seconds pretty often in current GCS

But that's not the because of the health check right? Think about this, if it's overloaded, it'll check with lower frequency. My theory is that it's not going to make any difference and introducing a thread is not going to fix this issue.

rkooo567 · 2022-10-26T16:26:07Z

src/ray/gcs/gcs_server/gcs_server.cc

+  if (RayConfig::instance().pull_based_healthcheck()) {
+    gcs_healthcheck_manager_ =
+        std::make_unique<GcsHealthCheckManager>(main_service_, node_death_callback);
+    for (const auto &item : gcs_init_data.Nodes()) {


maybe we can also have Initialize API? This seems to be a consistent API across modules

This couples it with the GCS and make the API more complicated. The convention here doesn't make sense.
To make it the same API as the others, this module needs to depend on GCS Init data and also Raylet pool and also needs to know how to construct the address and get the channel. That's overhead for maintenance.
A good practice is to think which way simplify writing the unit test (the less dependence the better).

rkooo567 · 2022-10-26T16:27:30Z

src/ray/gcs/gcs_server/gcs_server.cc

@@ -178,7 +178,9 @@ void GcsServer::DoStart(const GcsInitData &gcs_init_data) {
  // be run. Otherwise the node failure detector will mistake
  // some living nodes as dead as the timer inside node failure
  // detector is already run.
-  gcs_heartbeat_manager_->Start();
+  if (gcs_heartbeat_manager_) {


can you add a check at least one of gcs_heartbeat_manager_ or gcs_healthcheck_manager_ must be nullptr?

rkooo567 · 2022-10-26T16:37:30Z

src/ray/gcs/gcs_server/gcs_health_check_manager.cc

+
+void GcsHealthCheckManager::FailNode(const NodeID &node_id) {
+  RAY_LOG(WARNING) << "Node " << node_id << " is dead because the health check failed.";
+  on_node_death_callback_(node_id);


Is there any way to check if this is running inside io_service?

I don't think it's easy to do.
Also I don't think we should do this. It's totally normal for the callback running on another thread.
As how to make sure it's thread safe that's another story.
I think one practice is to always dispatch in the callback (this component is doing this, check the public api)

rkooo567 · 2022-10-26T16:38:50Z

src/ray/gcs/gcs_server/gcs_health_check_manager.cc

+  auto deadline =
+      std::chrono::system_clock::now() + std::chrono::milliseconds(manager_->timeout_ms_);
+  context_->set_deadline(deadline);
+  stub_->async()->Check(


why do we not include this to existing RPC paths? Like raylet_client.cc. Seems more complicated to have a separate code path like this.

I think the exiting path is way more complicated than this one. We shouldn't do it just because we are doing it.
Think about this, how to write unit test with this kind of deep coupling.

src/ray/gcs/gcs_server/gcs_health_check_manager.cc

rkooo567 · 2022-10-26T16:41:53Z

src/ray/gcs/gcs_server/gcs_health_check_manager.cc

+                health_check_remaining_ = manager_->failure_threshold_;
+              } else {
+                --health_check_remaining_;
+                RAY_LOG(WARNING) << "Health check failed for node " << node_id_


Maybe DEBUG? Feel like this may be too spammy?

I'm not sure. This should only happen when error happened (scale down won't trigger this). If it got printed a lot maybe it means we should increase the interval?

rkooo567 · 2022-10-26T16:42:36Z

src/ray/gcs/gcs_server/gcs_health_check_manager.cc

+              }
+
+              if (health_check_remaining_ == 0) {
+                manager_->io_service_.post([this]() { manager_->FailNode(node_id_); },


maybe we don't need to post again? Since we are already in the same io service

Remove the node will make 'this' not valid. So run it in another place after this. I can put some comments there.

rkooo567 · 2022-10-26T16:44:52Z

src/ray/gcs/gcs_server/gcs_health_check_manager.h

+    ~HealthCheckContext() {
+      timer_.cancel();
+      if (context_ != nullptr) {
+        context_->TryCancel();


What's the overhead of this call usually? Is it blocking?

no overhead.

src/ray/gcs/gcs_server/gcs_health_check_manager.cc

fishbone · 2022-10-26T19:22:29Z

@rkooo567 thanks for the review. This PR is not following the convention we are having because it adds complexity there. I'm following the direction making the component only doing its own job. No Raylet/GCS involved. This is good for the maintenance and easier to write test.
I'm also testing the gRPC callback API. In the future if it's good, we probably should just use this instead of the old way. Adding gRPC method is very heavy nowadays, which doesn't make sense.
Let's check in person if you have some concerns and we can figure out what to do with that.
Really appreciate your review here!

Signed-off-by: Yi Cheng <[email protected]>

fishbone · 2022-11-01T02:13:26Z

synced offline with Sang and here are the comments need to be cleared by me:

gRPC callback API is nice, but I am worried it exposes too much low level detail (maybe we need a wrapper) -> we will migrate to this in the future if it’s good.

We'll not change it in this PR. The callback API is new in gRPC and we should evaluate it and later if we decide to go with this one, we'll have some wrapper for this.

Network timeout 25 -> 60 seconds? or even longer

I'll change it.

the heartbeat timeout 5 seconds means the raylet should guarantee 5 seconds SLA for heartbeat endpoints when it is alive. Check + should we increase the timeout to like 30 seconds?

I'll change it.

stub uses num_cpus/2 threads for client & server? we should figure it out

I'll check the gRPC implementation. Good question!

The doc says gRPC unavailable is transient condition and we should backoff. Is 1 second backoff sufficient?

I'll test this and search around.

fishbone · 2022-11-01T05:30:16Z

stub uses num_cpus/2 threads for client & server? we should figure it out

The tuning hasn't been implemented grpc/grpc#28642 yet

The alternative cq is a global variable. So each process will only have one (for both client/server):
https://github.com/grpc/grpc/blob/master/src/cpp/common/completion_queue_cc.cc#L124

Signed-off-by: Yi Cheng <[email protected]>

fishbone · 2022-11-01T05:52:03Z

@rkooo567 all comments answered. let me know if you have some other concerns. i'll merge once the tests passed tmr.

Signed-off-by: Yi Cheng <[email protected]>

rkooo567 · 2022-11-01T06:50:22Z

Looking forward the improvement after the protocol change!!

Signed-off-by: Yi Cheng <[email protected]>

This PR introduced the pull-based health check to GCS. This is to fix the false positive issues when GCS is overloaded and incorrectly marks the healthy node as dead. The health check service in each ray component is implemented using gRPC built-in services. This PR focus on the client-side health check. The following features are supported: - Initial delay when a new node is added. This is for the new node to be able to ramp up. - Timeout for an RPC: in case of network issues, we introduce timeout, and the request fails to return within timeout is considered a failure. - If the health check failed X times consecutively, the node will be considered as dead. - We also introduce the interval that can be configured between two health checks sent. This client doesn't send two health checks in parallel, so the next one always waits until the first one is finished. This work has reference to k8s's healthiness probe features. A feature flag is introduced to turn it on or off and it's turned on in ray-project#29536 Signed-off-by: Weichen Xu <[email protected]>

fishbone changed the base branch from move-check-alive-to-node-mgr to master October 19, 2022 00:10

fishbone added 2 commits October 21, 2022 05:36

merge

b3136dd

Signed-off-by: Yi Cheng <[email protected]>

format

8691d16

Signed-off-by: Yi Cheng <[email protected]>

fishbone force-pushed the heartbeat-pull branch from ef1a404 to 8691d16 Compare October 21, 2022 05:37

fishbone added 3 commits October 21, 2022 05:48

up

9203adb

Signed-off-by: Yi Cheng <[email protected]>

format

5313f65

Signed-off-by: Yi Cheng <[email protected]>

turn it off

9ddd08b

Signed-off-by: Yi Cheng <[email protected]>

fishbone changed the title ~~Heartbeat pull~~ [core] Introduce pull based health check to GCS. Oct 21, 2022

fishbone marked this pull request as ready for review October 21, 2022 05:57

fishbone assigned scv119 and rkooo567 Oct 21, 2022

fishbone added 3 commits October 21, 2022 23:12

fix

de6eea8

Signed-off-by: Yi Cheng <[email protected]>

fix

f16d691

Signed-off-by: Yi Cheng <[email protected]>

Merge remote-tracking branch 'upstream/master' into heartbeat-pull

a1ac923

Signed-off-by: Yi Cheng <[email protected]>

scv119 reviewed Oct 24, 2022

View reviewed changes

scv119 added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Oct 24, 2022

scv119 assigned rickyyx and allenyin55 Oct 24, 2022

fix comments

067c6af

Signed-off-by: Yi Cheng <[email protected]>

fishbone removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Oct 24, 2022

scv119 approved these changes Oct 25, 2022

View reviewed changes

fix mac test

131f471

Signed-off-by: Yi Cheng <[email protected]>

rkooo567 reviewed Oct 26, 2022

View reviewed changes

fishbone added 2 commits October 31, 2022 17:21

fix some comments

6100241

Signed-off-by: Yi Cheng <[email protected]>

fix add metrics

702abab

Signed-off-by: Yi Cheng <[email protected]>

fishbone added 3 commits October 31, 2022 18:15

fix metrics

f779d4e

Signed-off-by: Yi Cheng <[email protected]>

Merge remote-tracking branch 'upstream/master' into heartbeat-pull

341569f

Signed-off-by: Yi Cheng <[email protected]>

lint

cc79f6a

Signed-off-by: Yi Cheng <[email protected]>

rkooo567 approved these changes Nov 1, 2022

View reviewed changes

fix comments

386bac3

Signed-off-by: Yi Cheng <[email protected]>

add event

f9508bc

Signed-off-by: Yi Cheng <[email protected]>

fishbone added 3 commits November 1, 2022 17:55

Merge remote-tracking branch 'upstream/master' into heartbeat-pull

761b9bd

Signed-off-by: Yi Cheng <[email protected]>

lint

e4882d2

Signed-off-by: Yi Cheng <[email protected]>

fix test

02b3469

Signed-off-by: Yi Cheng <[email protected]>

fishbone merged commit fdc7077 into master Nov 2, 2022

fishbone deleted the heartbeat-pull branch November 2, 2022 00:32

[core] Introduce pull based health check to GCS. #29442

[core] Introduce pull based health check to GCS. #29442

Conversation

fishbone commented Oct 18, 2022 • edited Loading

Why are these changes needed?

Related issue number

Checks

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fishbone Oct 24, 2022 • edited Loading

Choose a reason for hiding this comment

scv119 commented Oct 24, 2022

fishbone commented Oct 25, 2022

rkooo567 commented Oct 25, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fishbone Oct 31, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fishbone commented Oct 26, 2022

fishbone commented Nov 1, 2022

fishbone commented Nov 1, 2022

fishbone commented Nov 1, 2022

rkooo567 commented Nov 1, 2022 • edited Loading

fishbone commented Oct 18, 2022 •

edited

Loading

fishbone Oct 24, 2022 •

edited

Loading

fishbone Oct 31, 2022 •

edited

Loading

rkooo567 commented Nov 1, 2022 •

edited

Loading