-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gRPC health server sets serving status to NOT_SERVING on defrag #16278
Conversation
86b1ea4
to
988972c
Compare
Please see #16276 (comment) |
14b7905
to
3d4aa56
Compare
Hi @ahrtr @serathius @jmhbnz @fuweid, could you please take a look this approach and share your thoughts about it?
I was inspired by what kubernetes/kubernetes#93280 (comment) proposed in the k8s issue. It will be helpful for eliminating online defrag impact. |
3d4aa56
to
a8b2f92
Compare
@ahrtr are you referring to Livez/Readyz, 3.7 defrag plans or experimental dependency?
IMO this PR is a bugfix for an existing problem with defrag. Fundamentally if we don't want to use https://pkg.go.dev/google.golang.org/grpc/health, we shouldn't merge this, otherwise I think it's very useful fix. |
@chaochn47 For the change, it looks good to me. It's helpful to the kube-apiserver which has multiple endpoints instead of one service. The kube-apiserver can forward the requests to available servers. What do you think about introducing the flag to enable this feature? It might bring heavy traffic to other servers. |
Thanks @fuweid for the feedback!
There is no built-in supported capacity aware load balancer in gRPC-go. I will explore and explain it in the design doc after livez/readyz. For disabling this feature by introducing a flag in etcd client, as I explained in the #16278 (comment), we could add flag in the app that consuming etcd client instead. |
Please also rebase the PR. |
d1544fd
to
3397052
Compare
Done. /cc @ahrtr IMHO, the PR itself is good to merge and safe to cherry picked to release branch since it's recommended to do defrag against one member at a time and client side fail over is disabled by default. Some users may configure grpc probes in their livez set up, we can just call it out that please consider moving to http livez probe instead when upgrading. One of the remaining risks to evaluate before expanding this to other readyz checks is client should fail open (at least keep one connection in the picker if all the etcd health servers reports NOT_SERVING). |
I see that the e2e test case is very similar to the integration test case, is it possible to write a common test case? |
Another big question is should we add an experimental flag something like " |
Enabling failpoint both in e2e and integration test has not yet built into the common test framework. This can be a separate PR to follow up.
Okay. |
24a5666
to
5775609
Compare
Kindly ping @serathius @wenjiaswe @ptabor @mitake @spzala @jmhbnz ~ |
5775609
to
a853196
Compare
What's the suggested process to move progress on this PR? I think all the review comments are addressed. Edit: found one reference
https://etcd.io/docs/v3.5/triage/prs/#poke-reviewer-if-needed |
This is a nice to have feature, it can prevent client from being blocked on the member in progress of defragmentation. We also have flag Leave to other maintainers to take a second look. Thanks @chaochn47 |
Signed-off-by: Chao Chen <[email protected]>
Signed-off-by: Chao Chen <[email protected]>
Signed-off-by: Chao Chen <[email protected]>
a853196
to
9a59230
Compare
I think this is good enough to merge, still please consider implementing followups. |
gRPC health server sets serving status to NOT_SERVING on defrag Backport from 3.6 in etcd-io#16278 Co-authored-by: Chao Chen <[email protected]> Signed-off-by: Thomas Jungblut <[email protected]>
gRPC health server sets serving status to NOT_SERVING on defrag Backport from 3.6 in etcd-io#16278 Co-authored-by: Chao Chen <[email protected]> Signed-off-by: Thomas Jungblut <[email protected]>
Trying to fix kubernetes/kubernetes#93280 on etcd side since etcd is part of the kubernetes sigs.
Before failure rate is around 33%
After failure rate is 0%
Please read https://github.com/etcd-io/etcd/blob/main/CONTRIBUTING.md#contribution-flow.