Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core] Introduce pull based health check to GCS. #29442

Merged
merged 20 commits into from
Nov 2, 2022
Merged

Conversation

fishbone
Copy link
Contributor

@fishbone fishbone commented Oct 18, 2022

Why are these changes needed?

This PR introduced the pull-based health check to GCS. This is to fix the false positive issues when GCS is overloaded and incorrectly marks the healthy node as dead.

The health check service in each ray component is implemented using gRPC built-in services. This PR focus on the client-side health check.

The following features are supported:

  • Initial delay when a new node is added. This is for the new node to be able to ramp up.
  • Timeout for an RPC: in case of network issues, we introduce timeout, and the request fails to return within timeout is considered a failure.
  • If the health check failed X times consecutively, the node will be considered as dead.
  • We also introduce the interval that can be configured between two health checks sent.

This client doesn't send two health checks in parallel, so the next one always waits until the first one is finished.

This work has reference to k8s's healthiness probe features.

A feature flag is introduced to turn it on or off and it's turned on in #29536

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@fishbone fishbone changed the base branch from move-check-alive-to-node-mgr to master October 19, 2022 00:10
Signed-off-by: Yi Cheng <[email protected]>
Signed-off-by: Yi Cheng <[email protected]>
Signed-off-by: Yi Cheng <[email protected]>
Signed-off-by: Yi Cheng <[email protected]>
Signed-off-by: Yi Cheng <[email protected]>
@fishbone fishbone changed the title Heartbeat pull [core] Introduce pull based health check to GCS. Oct 21, 2022
@fishbone fishbone marked this pull request as ready for review October 21, 2022 05:57

/// The context of the health check for each nodes.
absl::flat_hash_map<NodeID, std::unique_ptr<HealthCheckContext>>
inflight_health_checks_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would health_checked_nodes_ or simply nodes_ be a better name?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe health_check_contexts_?

/// \param on_node_death_callback The callback function when some node is marked as
/// failure. \param initial_delay_ms The delay for the first health check. \param
/// period_ms The interval between two health checks for the same node. \param
/// failure_threshold The threshold before a node will be marked as dead due to health
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

failure_threshold -> num_consecutive_failures_threshold?

Copy link
Contributor Author

@fishbone fishbone Oct 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I borrow the terminology from k8s (https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/#configure-probes)

Since the whole protocol also has referred a lot from that, I think maybe we should keep it?

(no strong opinion).

@scv119
Copy link
Contributor

scv119 commented Oct 24, 2022

Looks very clean! just minor comments.

@scv119 scv119 added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Oct 24, 2022
Signed-off-by: Yi Cheng <[email protected]>
@fishbone fishbone removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Oct 24, 2022
@fishbone
Copy link
Contributor Author

mac unit test failed //:gcs_health_check_manager_test
checking

@rkooo567
Copy link
Contributor

@iycheng can you give me one more day. I will try reviewing this by EoD tomorrow in my time zone

Signed-off-by: Yi Cheng <[email protected]>
@@ -78,6 +77,9 @@
if not ray._raylet.Config.use_ray_syncer():
_METRICS.append("ray_outbound_heartbeat_size_kb_sum")

if not ray._raylet.Config.pull_based_healthcheck():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider adding metrics for pull based heartbeat too?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could be in other PR

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could add it in this PR

Copy link
Contributor Author

@fishbone fishbone Oct 31, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rkooo567 I think we should do a better job to monitor the RPC requests.
The cpp -> python -> opencensus is not a good practice I believe.

I can add gRPC metrics here. But in long term, we should do cpp -> opencensus agent directly.
We can use tools builtin gRPC to do this (https://github.com/grpc/grpc/blob/5f6c357e741207f4af7e3b642a486bdda12c93df/include/grpcpp/opencensus.h#L35)

/// A feature flag to enable pull based health check.
/// TODO: Turn it on by default
RAY_CONFIG(bool, pull_based_healthcheck, false)
RAY_CONFIG(int64_t, health_check_initial_delay_ms, 1000)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feel like it is unnecessary? It'd take 30 seconds until the node is marked as dead anyway?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. 30s is old protocol. This is the new protocol. So here, if a node is dead, it only need to take 3s to mark the node dead.


if (RayConfig::instance().pull_based_healthcheck()) {
gcs_healthcheck_manager_ =
std::make_unique<GcsHealthCheckManager>(main_service_, node_death_callback);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't we run this in heartbeat_manager_io_service_?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm trying to make it simple. As we know, it doesn't help a lot putting it on a separate thread for this case (we are doing this and seeing issues) since the bottleneck is not there.

I would prefer this way and if it's the bottleneck, tuning it later.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm still prefer not to rely on this sort of critical operation on the main thread as I've seen main thread being blocked > 10 seconds pretty often in current GCS

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But that's not the because of the health check right? Think about this, if it's overloaded, it'll check with lower frequency. My theory is that it's not going to make any difference and introducing a thread is not going to fix this issue.

if (RayConfig::instance().pull_based_healthcheck()) {
gcs_healthcheck_manager_ =
std::make_unique<GcsHealthCheckManager>(main_service_, node_death_callback);
for (const auto &item : gcs_init_data.Nodes()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we can also have Initialize API? This seems to be a consistent API across modules

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This couples it with the GCS and make the API more complicated. The convention here doesn't make sense.
To make it the same API as the others, this module needs to depend on GCS Init data and also Raylet pool and also needs to know how to construct the address and get the channel. That's overhead for maintenance.
A good practice is to think which way simplify writing the unit test (the less dependence the better).

@@ -178,7 +178,9 @@ void GcsServer::DoStart(const GcsInitData &gcs_init_data) {
// be run. Otherwise the node failure detector will mistake
// some living nodes as dead as the timer inside node failure
// detector is already run.
gcs_heartbeat_manager_->Start();
if (gcs_heartbeat_manager_) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a check at least one of gcs_heartbeat_manager_ or gcs_healthcheck_manager_ must be nullptr?


void GcsHealthCheckManager::FailNode(const NodeID &node_id) {
RAY_LOG(WARNING) << "Node " << node_id << " is dead because the health check failed.";
on_node_death_callback_(node_id);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any way to check if this is running inside io_service?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's easy to do.
Also I don't think we should do this. It's totally normal for the callback running on another thread.
As how to make sure it's thread safe that's another story.
I think one practice is to always dispatch in the callback (this component is doing this, check the public api)

auto deadline =
std::chrono::system_clock::now() + std::chrono::milliseconds(manager_->timeout_ms_);
context_->set_deadline(deadline);
stub_->async()->Check(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we not include this to existing RPC paths? Like raylet_client.cc. Seems more complicated to have a separate code path like this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the exiting path is way more complicated than this one. We shouldn't do it just because we are doing it.
Think about this, how to write unit test with this kind of deep coupling.

health_check_remaining_ = manager_->failure_threshold_;
} else {
--health_check_remaining_;
RAY_LOG(WARNING) << "Health check failed for node " << node_id_
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe DEBUG? Feel like this may be too spammy?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure. This should only happen when error happened (scale down won't trigger this). If it got printed a lot maybe it means we should increase the interval?

}

if (health_check_remaining_ == 0) {
manager_->io_service_.post([this]() { manager_->FailNode(node_id_); },
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we don't need to post again? Since we are already in the same io service

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove the node will make 'this' not valid. So run it in another place after this. I can put some comments there.

~HealthCheckContext() {
timer_.cancel();
if (context_ != nullptr) {
context_->TryCancel();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the overhead of this call usually? Is it blocking?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no overhead.

@fishbone
Copy link
Contributor Author

@rkooo567 thanks for the review. This PR is not following the convention we are having because it adds complexity there. I'm following the direction making the component only doing its own job. No Raylet/GCS involved. This is good for the maintenance and easier to write test.
I'm also testing the gRPC callback API. In the future if it's good, we probably should just use this instead of the old way. Adding gRPC method is very heavy nowadays, which doesn't make sense.
Let's check in person if you have some concerns and we can figure out what to do with that.
Really appreciate your review here!

Signed-off-by: Yi Cheng <[email protected]>
Signed-off-by: Yi Cheng <[email protected]>
@fishbone
Copy link
Contributor Author

fishbone commented Nov 1, 2022

synced offline with Sang and here are the comments need to be cleared by me:

  • gRPC callback API is nice, but I am worried it exposes too much low level detail (maybe we need a wrapper) -> we will migrate to this in the future if it’s good.

We'll not change it in this PR. The callback API is new in gRPC and we should evaluate it and later if we decide to go with this one, we'll have some wrapper for this.

  • Network timeout 25 -> 60 seconds? or even longer

I'll change it.

  • the heartbeat timeout 5 seconds means the raylet should guarantee 5 seconds SLA for heartbeat endpoints when it is alive. Check + should we increase the timeout to like 30 seconds?

I'll change it.

  • stub uses num_cpus/2 threads for client & server? we should figure it out

I'll check the gRPC implementation. Good question!

  • The doc says gRPC unavailable is transient condition and we should backoff. Is 1 second backoff sufficient?

I'll test this and search around.

@fishbone
Copy link
Contributor Author

fishbone commented Nov 1, 2022

stub uses num_cpus/2 threads for client & server? we should figure it out

The tuning hasn't been implemented grpc/grpc#28642 yet

The alternative cq is a global variable. So each process will only have one (for both client/server):
https://github.com/grpc/grpc/blob/master/src/cpp/common/completion_queue_cc.cc#L124

Signed-off-by: Yi Cheng <[email protected]>
@fishbone
Copy link
Contributor Author

fishbone commented Nov 1, 2022

@rkooo567 all comments answered. let me know if you have some other concerns. i'll merge once the tests passed tmr.

Signed-off-by: Yi Cheng <[email protected]>
@rkooo567
Copy link
Contributor

rkooo567 commented Nov 1, 2022

Looking forward the improvement after the protocol change!!

@fishbone fishbone merged commit fdc7077 into master Nov 2, 2022
@fishbone fishbone deleted the heartbeat-pull branch November 2, 2022 00:32
WeichenXu123 pushed a commit to WeichenXu123/ray that referenced this pull request Dec 19, 2022
This PR introduced the pull-based health check to GCS. This is to fix the false positive issues when GCS is overloaded and incorrectly marks the healthy node as dead.

The health check service in each ray component is implemented using gRPC built-in services. This PR focus on the client-side health check.

The following features are supported:

- Initial delay when a new node is added. This is for the new node to be able to ramp up.
- Timeout for an RPC: in case of network issues, we introduce timeout, and the request fails to return within timeout is considered a failure.
- If the health check failed X times consecutively, the node will be considered as dead.
- We also introduce the interval that can be configured between two health checks sent.

This client doesn't send two health checks in parallel, so the next one always waits until the first one is finished.

This work has reference to k8s's healthiness probe features.

A feature flag is introduced to turn it on or off and it's turned on in ray-project#29536

Signed-off-by: Weichen Xu <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants