Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix quorum check when recovering broken etcd cluster (with etcd 3.5.x) #8126

Merged
merged 1 commit into from
Oct 26, 2021

Conversation

floryut
Copy link
Member

@floryut floryut commented Oct 25, 2021

What type of PR is this?
/kind bug

What this PR does / why we need it:
Since updating to etcd 3.5.x, the recover broken etcd role is failing.
A bug introduced in 3.4.x has been fixed in 3.5.x, the bug was sending every output to error if the cluster was unhealthy when using the endpoint_health command.

The second changes is related to etcdctl member list taking a few seconds to display the newly created members, so retry is now needed.

Which issue(s) this PR fixes:
None

Special notes for your reviewer:
etcdctl endpoint health for a broken etcd cluster with 3.4.x version:

"stderr_lines": ["https://172.30.72.98:2379 is healthy: successfully committed proposal: took = 46.482728ms", "https://172.30.72.101:2379 is healthy: successfully committed proposal: took = 46.596092ms", "https://172.30.72.100:2379 is unhealthy: failed to commit proposal: context deadline exceeded", "Error: unhealthy cluster"]
"stdout_lines": []

etcdctl endpoint health for a broken etcd cluster with 3.5.x version:

"stderr_lines": ["https://172.30.72.95:2379 is unhealthy: failed to commit proposal: context deadline exceeded", "Error: unhealthy cluster"], 
"stdout_lines": ["https://172.30.72.98:2379 is healthy: successfully committed proposal: took = 16.923959ms", "https://172.30.72.92:2379 is healthy: successfully committed proposal: took = 16.786043ms"]

Does this PR introduce a user-facing change?:

Fix quorum check when recovering broken etcd cluster (with etcd 3.5.x)

@k8s-ci-robot
Copy link
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Oct 25, 2021
@k8s-ci-robot k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Oct 25, 2021
@k8s-ci-robot k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Oct 25, 2021
@floryut floryut force-pushed the etcd_debug branch 5 times, most recently from b853a86 to 1a009ef Compare October 26, 2021 13:29
@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Oct 26, 2021
@floryut floryut changed the title debug Fix quorum check when recovering broken etcd cluster when using etcd 3.5.x Oct 26, 2021
@floryut floryut changed the title Fix quorum check when recovering broken etcd cluster when using etcd 3.5.x Fix quorum check when recovering broken etcd cluster (with etcd 3.5.x) Oct 26, 2021
@k8s-ci-robot k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Oct 26, 2021
@floryut
Copy link
Member Author

floryut commented Oct 26, 2021

@floryut floryut marked this pull request as ready for review October 26, 2021 14:54
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 26, 2021
@k8s-ci-robot k8s-ci-robot added size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Oct 26, 2021
@floryut
Copy link
Member Author

floryut commented Oct 26, 2021

/cc @EppO @oomichi @cristicalin

@@ -20,10 +20,9 @@
when:
- groups['broken_etcd']

# When there is an error, everything is printed in stderr_lines, even "is healthy" messages.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not anymore 🎉 @EppO do you remember our etcd 3.4.x crusade ? brings back some memories 😆

@cristicalin
Copy link
Contributor

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cristicalin, floryut

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@oomichi
Copy link
Contributor

oomichi commented Oct 26, 2021

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 26, 2021
@k8s-ci-robot k8s-ci-robot merged commit 9eacde2 into kubernetes-sigs:master Oct 26, 2021
@floryut floryut added the kind/bug Categorizes issue or PR as related to a bug. label Oct 27, 2021
@floryut floryut mentioned this pull request Dec 21, 2021
sakuraiyuta pushed a commit to sakuraiyuta/kubespray that referenced this pull request Apr 16, 2022
LuckySB pushed a commit to southbridgeio/kubespray that referenced this pull request Jun 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants