-
Notifications
You must be signed in to change notification settings - Fork 6.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Offline control plane recover #10660
Offline control plane recover #10660
Conversation
ignore_errors ignores errors occur within "file" module. However, when the target node is offline, the playbook will still fail at this task with node "unreachable" state. Setting "ignore_unreachable: true" allows the playbook to bypass offline nodes and move on to proceed recovery tasks on remaining online nodes.
Welcome @yuha0! |
Hi @yuha0. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/ok-to-test |
Why do you add |
Hi @VannTen thanks for looking into this.
Not exactly. First of all, my understanding is that, in ansible, When I ran the playbook a few weeks ago, I only needed to ignore unreachable on that one single task. Because that particular task runs only on broken nodes and not on health ones: kubespray/roles/recover_control_plane/etcd/tasks/main.yml Lines 39 to 40 in 213d893
However, other tasks will run on healthy etcd nodes as well. For example, the immediate task after the above is: kubespray/roles/recover_control_plane/etcd/tasks/main.yml Lines 47 to 52 in 213d893
which removes broken etcd nodes' certs from healthy nodes (fwiw, I think this task would be nicer if implemented with Here's a small POC: inventory.ini:
task.yaml:
This play only works when |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: floryut, yuha0 The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
* ignore_unreachable for etcd dir cleanup ignore_errors ignores errors occur within "file" module. However, when the target node is offline, the playbook will still fail at this task with node "unreachable" state. Setting "ignore_unreachable: true" allows the playbook to bypass offline nodes and move on to proceed recovery tasks on remaining online nodes. * Re-arrange control plane recovery runbook steps * Remove suggestion to manually update IP addresses The suggestion was added in 48a1828 4 years ago. But a new task added 2 years ago, in ee0f1e9, automatically update API server arg with updated etcd node ip addresses. This suggestion is no longer needed.
/kind bug
What this PR does / why we need it:
bug fix:
Add
ignore_unreachable: true
toRemove etcd data dir
task, so that the playbook does not fail with node "unreachable" state when the broken etcd node is unreachable.documentation
The runbook steps are in two different paragraphs in two different sections. Combine them to make runbook steps clear.
At the bottom there's a suggestion:
I find the wording to be a bit confusing, makes it not super useful as a suggestion to users:
In fact, The suggestion was added in 48a1828 4 years ago. But a new task added 2 years ago, in ee0f1e9, automatically update API server arg with updated etcd node ip addresses. Please let me know if I am wrong, but from what I found in the repo, and based on my experience using the playbook last week, this suggestion seems to be unnecessary at this point.
Which issue(s) this PR fixes:
Fixes #10649
Special notes for your reviewer:
Does this PR introduce a user-facing change?: