-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[core] Introduce pull based health check to GCS. #29442
Conversation
Signed-off-by: Yi Cheng <[email protected]>
Signed-off-by: Yi Cheng <[email protected]>
ef1a404
to
8691d16
Compare
Signed-off-by: Yi Cheng <[email protected]>
Signed-off-by: Yi Cheng <[email protected]>
Signed-off-by: Yi Cheng <[email protected]>
Signed-off-by: Yi Cheng <[email protected]>
Signed-off-by: Yi Cheng <[email protected]>
Signed-off-by: Yi Cheng <[email protected]>
|
||
/// The context of the health check for each nodes. | ||
absl::flat_hash_map<NodeID, std::unique_ptr<HealthCheckContext>> | ||
inflight_health_checks_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would health_checked_nodes_
or simply nodes_
be a better name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe health_check_contexts_
?
/// \param on_node_death_callback The callback function when some node is marked as | ||
/// failure. \param initial_delay_ms The delay for the first health check. \param | ||
/// period_ms The interval between two health checks for the same node. \param | ||
/// failure_threshold The threshold before a node will be marked as dead due to health |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
failure_threshold -> num_consecutive_failures_threshold?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I borrow the terminology from k8s (https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/#configure-probes)
Since the whole protocol also has referred a lot from that, I think maybe we should keep it?
(no strong opinion).
Looks very clean! just minor comments. |
Signed-off-by: Yi Cheng <[email protected]>
mac unit test failed |
@iycheng can you give me one more day. I will try reviewing this by EoD tomorrow in my time zone |
Signed-off-by: Yi Cheng <[email protected]>
@@ -78,6 +77,9 @@ | |||
if not ray._raylet.Config.use_ray_syncer(): | |||
_METRICS.append("ray_outbound_heartbeat_size_kb_sum") | |||
|
|||
if not ray._raylet.Config.pull_based_healthcheck(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider adding metrics for pull based heartbeat too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could be in other PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could add it in this PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rkooo567 I think we should do a better job to monitor the RPC requests.
The cpp -> python -> opencensus is not a good practice I believe.
I can add gRPC metrics here. But in long term, we should do cpp -> opencensus agent directly.
We can use tools builtin gRPC to do this (https://github.com/grpc/grpc/blob/5f6c357e741207f4af7e3b642a486bdda12c93df/include/grpcpp/opencensus.h#L35)
src/ray/common/ray_config_def.h
Outdated
/// A feature flag to enable pull based health check. | ||
/// TODO: Turn it on by default | ||
RAY_CONFIG(bool, pull_based_healthcheck, false) | ||
RAY_CONFIG(int64_t, health_check_initial_delay_ms, 1000) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feel like it is unnecessary? It'd take 30 seconds until the node is marked as dead anyway?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. 30s is old protocol. This is the new protocol. So here, if a node is dead, it only need to take 3s to mark the node dead.
|
||
if (RayConfig::instance().pull_based_healthcheck()) { | ||
gcs_healthcheck_manager_ = | ||
std::make_unique<GcsHealthCheckManager>(main_service_, node_death_callback); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why don't we run this in heartbeat_manager_io_service_
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm trying to make it simple. As we know, it doesn't help a lot putting it on a separate thread for this case (we are doing this and seeing issues) since the bottleneck is not there.
I would prefer this way and if it's the bottleneck, tuning it later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm still prefer not to rely on this sort of critical operation on the main thread as I've seen main thread being blocked > 10 seconds pretty often in current GCS
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But that's not the because of the health check right? Think about this, if it's overloaded, it'll check with lower frequency. My theory is that it's not going to make any difference and introducing a thread is not going to fix this issue.
if (RayConfig::instance().pull_based_healthcheck()) { | ||
gcs_healthcheck_manager_ = | ||
std::make_unique<GcsHealthCheckManager>(main_service_, node_death_callback); | ||
for (const auto &item : gcs_init_data.Nodes()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we can also have Initialize
API? This seems to be a consistent API across modules
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This couples it with the GCS and make the API more complicated. The convention here doesn't make sense.
To make it the same API as the others, this module needs to depend on GCS Init data and also Raylet pool and also needs to know how to construct the address and get the channel. That's overhead for maintenance.
A good practice is to think which way simplify writing the unit test (the less dependence the better).
@@ -178,7 +178,9 @@ void GcsServer::DoStart(const GcsInitData &gcs_init_data) { | |||
// be run. Otherwise the node failure detector will mistake | |||
// some living nodes as dead as the timer inside node failure | |||
// detector is already run. | |||
gcs_heartbeat_manager_->Start(); | |||
if (gcs_heartbeat_manager_) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a check at least one of gcs_heartbeat_manager_
or gcs_healthcheck_manager_
must be nullptr?
|
||
void GcsHealthCheckManager::FailNode(const NodeID &node_id) { | ||
RAY_LOG(WARNING) << "Node " << node_id << " is dead because the health check failed."; | ||
on_node_death_callback_(node_id); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any way to check if this is running inside io_service?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it's easy to do.
Also I don't think we should do this. It's totally normal for the callback running on another thread.
As how to make sure it's thread safe that's another story.
I think one practice is to always dispatch in the callback (this component is doing this, check the public api)
auto deadline = | ||
std::chrono::system_clock::now() + std::chrono::milliseconds(manager_->timeout_ms_); | ||
context_->set_deadline(deadline); | ||
stub_->async()->Check( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we not include this to existing RPC paths? Like raylet_client.cc. Seems more complicated to have a separate code path like this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the exiting path is way more complicated than this one. We shouldn't do it just because we are doing it.
Think about this, how to write unit test with this kind of deep coupling.
health_check_remaining_ = manager_->failure_threshold_; | ||
} else { | ||
--health_check_remaining_; | ||
RAY_LOG(WARNING) << "Health check failed for node " << node_id_ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe DEBUG? Feel like this may be too spammy?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure. This should only happen when error happened (scale down won't trigger this). If it got printed a lot maybe it means we should increase the interval?
} | ||
|
||
if (health_check_remaining_ == 0) { | ||
manager_->io_service_.post([this]() { manager_->FailNode(node_id_); }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we don't need to post again? Since we are already in the same io service
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove the node will make 'this' not valid. So run it in another place after this. I can put some comments there.
~HealthCheckContext() { | ||
timer_.cancel(); | ||
if (context_ != nullptr) { | ||
context_->TryCancel(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the overhead of this call usually? Is it blocking?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no overhead.
@rkooo567 thanks for the review. This PR is not following the convention we are having because it adds complexity there. I'm following the direction making the component only doing its own job. No Raylet/GCS involved. This is good for the maintenance and easier to write test. |
Signed-off-by: Yi Cheng <[email protected]>
Signed-off-by: Yi Cheng <[email protected]>
Signed-off-by: Yi Cheng <[email protected]>
Signed-off-by: Yi Cheng <[email protected]>
Signed-off-by: Yi Cheng <[email protected]>
synced offline with Sang and here are the comments need to be cleared by me:
We'll not change it in this PR. The callback API is new in gRPC and we should evaluate it and later if we decide to go with this one, we'll have some wrapper for this.
I'll change it.
I'll change it.
I'll check the gRPC implementation. Good question!
I'll test this and search around. |
The tuning hasn't been implemented grpc/grpc#28642 yet The alternative cq is a global variable. So each process will only have one (for both client/server): |
Signed-off-by: Yi Cheng <[email protected]>
@rkooo567 all comments answered. let me know if you have some other concerns. i'll merge once the tests passed tmr. |
Signed-off-by: Yi Cheng <[email protected]>
Looking forward the improvement after the protocol change!! |
Signed-off-by: Yi Cheng <[email protected]>
Signed-off-by: Yi Cheng <[email protected]>
Signed-off-by: Yi Cheng <[email protected]>
This PR introduced the pull-based health check to GCS. This is to fix the false positive issues when GCS is overloaded and incorrectly marks the healthy node as dead. The health check service in each ray component is implemented using gRPC built-in services. This PR focus on the client-side health check. The following features are supported: - Initial delay when a new node is added. This is for the new node to be able to ramp up. - Timeout for an RPC: in case of network issues, we introduce timeout, and the request fails to return within timeout is considered a failure. - If the health check failed X times consecutively, the node will be considered as dead. - We also introduce the interval that can be configured between two health checks sent. This client doesn't send two health checks in parallel, so the next one always waits until the first one is finished. This work has reference to k8s's healthiness probe features. A feature flag is introduced to turn it on or off and it's turned on in ray-project#29536 Signed-off-by: Weichen Xu <[email protected]>
Why are these changes needed?
This PR introduced the pull-based health check to GCS. This is to fix the false positive issues when GCS is overloaded and incorrectly marks the healthy node as dead.
The health check service in each ray component is implemented using gRPC built-in services. This PR focus on the client-side health check.
The following features are supported:
This client doesn't send two health checks in parallel, so the next one always waits until the first one is finished.
This work has reference to k8s's healthiness probe features.
A feature flag is introduced to turn it on or off and it's turned on in #29536
Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.