Offline control plane recover #10660

yuha0 · 2023-11-28T07:32:04Z

/kind bug

What this PR does / why we need it:

bug fix:

Add ignore_unreachable: true to Remove etcd data dir task, so that the playbook does not fail with node "unreachable" state when the broken etcd node is unreachable.
documentation
- The runbook steps are in two different paragraphs in two different sections. Combine them to make runbook steps clear.
- At the bottom there's a suggestion:
  
  kubespray/docs/recover-control-plane.md
  
  Line 38 in d583d33
  
  * If your new control plane nodes have new ip addresses you may have to change settings in various places.
  
  I find the wording to be a bit confusing, makes it not super useful as a suggestion to users:
  - what does "may" mean? Do I need to update IPs or not?
  - what are these "various places" exactly? How do I know that I didn't miss any places?
    In fact, The suggestion was added in 48a1828 4 years ago. But a new task added 2 years ago, in ee0f1e9, automatically update API server arg with updated etcd node ip addresses. Please let me know if I am wrong, but from what I found in the repo, and based on my experience using the playbook last week, this suggestion seems to be unnecessary at this point.

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

Fixes running `recover-control-plane.yml` with offline broken etcd nodes.

ignore_errors ignores errors occur within "file" module. However, when the target node is offline, the playbook will still fail at this task with node "unreachable" state. Setting "ignore_unreachable: true" allows the playbook to bypass offline nodes and move on to proceed recovery tasks on remaining online nodes.

The suggestion was added in 48a1828 4 years ago. But a new task added 2 years ago, in ee0f1e9, automatically update API server arg with updated etcd node ip addresses. This suggestion is no longer needed.

linux-foundation-easycla · 2023-11-28T07:32:10Z

The committers listed above are authorized under a signed CLA.

✅ login: yuha0 / name: Yuhao Zhang (6400d24, a9fa0ca, 213d893)

k8s-ci-robot · 2023-11-28T07:32:13Z

Welcome @yuha0!

It looks like this is your first PR to kubernetes-sigs/kubespray 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/kubespray has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

k8s-ci-robot · 2023-11-28T07:32:15Z

Hi @yuha0. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

yankay · 2023-12-07T02:24:45Z

/ok-to-test

VannTen · 2023-12-11T21:03:18Z

Why do you add ignore_unreachable only to that task ? I mean, in cases where the node is unreachable, the other task with ignore_errors would have the same problem.

yuha0 · 2023-12-14T16:58:01Z

Hi @VannTen thanks for looking into this.

I mean, in cases where the node is unreachable, the other task with ignore_errors would have the same problem.

Not exactly.

First of all, my understanding is that, in ansible, with_items executes the task on each item in the list, and only returns one single status code (and it's up to the user to register the result and iterate over each item's return code). So in this case, whether ignore_errors/ignore_unreachable are needed would depend on if you want to tolerate all nodes in the group being unreachable or having task execution errors.

When I ran the playbook a few weeks ago, I only needed to ignore unreachable on that one single task. Because that particular task runs only on broken nodes and not on health ones:

kubespray/roles/recover_control_plane/etcd/tasks/main.yml

Lines 39 to 40 in 213d893

    
           delegate_to: "{{ item }}" 
        
           with_items: "{{ groups['broken_etcd'] }}"

However, other tasks will run on healthy etcd nodes as well. For example, the immediate task after the above is:

kubespray/roles/recover_control_plane/etcd/tasks/main.yml

Lines 47 to 52 in 213d893

    
           - name: Delete old certificates 
        
             shell: "rm {{ etcd_cert_dir }}/*{{ item }}*" 
        
             with_items: "{{ groups['broken_etcd'] }}" 
        
             register: delete_old_cerificates 
        
             ignore_errors: true 
        
             when: groups['broken_etcd']

which removes broken etcd nodes' certs from healthy nodes (fwiw, I think this task would be nicer if implemented with file module that sets the status to be absent, and not a rm shell command without -f, for idempotence. But that's not relevant to this issue/PR). I don't think leaving those old, unused cert files in that directory is harmful, but I do expect the task to be executed successfully on the healthy etcd nodes -- if a healthy node is somehow unreachable, then the user is having a bigger problem and this playbook won't be helpful at all.

Here's a small POC:

inventory.ini:

[healthy_etcd]
a_reachable_node
[broken_etcd]
an_unreachable_node

task.yaml:

- hosts: all
  tasks:
  - name: remove file with file module
    ignore_errors: true
    ignore_unreachable: true
    delegate_to: "{{ item }}"
    with_items: "{{ groups['broken_etcd'] }}"
    file:
      path: /tmp/test
      state: absent
    when:
    - groups['broken_etcd']
  - name: remove file with shell module
    ignore_errors: true
    shell: "rm -f /tmp/test"
    with_items: "{{ groups['broken_etcd'] }}"
    when:
    - groups['broken_etcd']
  - name: post removal
    debug:
      msg: "previous tasks have been passed/skipped without problem"

This play only works when ignore_unreachable is true in the first task.

VannTen · 2024-01-11T13:41:59Z

I see, that makes sense, at least enough sense for me
/lgtm
/assign @floryut @yankay
(for approval)

floryut

I see, that makes sense, at least enough sense for me /lgtm /assign @floryut @yankay (for approval)

lgtm for me 👍

k8s-ci-robot · 2024-01-22T16:15:44Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: floryut, yuha0

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [floryut]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

* ignore_unreachable for etcd dir cleanup ignore_errors ignores errors occur within "file" module. However, when the target node is offline, the playbook will still fail at this task with node "unreachable" state. Setting "ignore_unreachable: true" allows the playbook to bypass offline nodes and move on to proceed recovery tasks on remaining online nodes. * Re-arrange control plane recovery runbook steps * Remove suggestion to manually update IP addresses The suggestion was added in 48a1828 4 years ago. But a new task added 2 years ago, in ee0f1e9, automatically update API server arg with updated etcd node ip addresses. This suggestion is no longer needed.

yuha0 added 3 commits November 27, 2023 22:56

Re-arrange control plane recovery runbook steps

a9fa0ca

Remove suggestion to manually update IP addresses

213d893

The suggestion was added in 48a1828 4 years ago. But a new task added 2 years ago, in ee0f1e9, automatically update API server arg with updated etcd node ip addresses. This suggestion is no longer needed.

k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Nov 28, 2023

k8s-ci-robot requested review from holmsten and qvicksilver November 28, 2023 07:32

k8s-ci-robot added the cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. label Nov 28, 2023

k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Nov 28, 2023

k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Nov 28, 2023

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Dec 7, 2023

k8s-ci-robot assigned floryut, yankay and VannTen Jan 11, 2024

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 11, 2024

floryut approved these changes Jan 22, 2024

View reviewed changes

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 22, 2024

k8s-ci-robot merged commit 0e971a3 into kubernetes-sigs:master Jan 22, 2024
63 checks passed

mzaian mentioned this pull request Apr 26, 2024

Release Proposal v2.25 #11126

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Offline control plane recover #10660

Offline control plane recover #10660

yuha0 commented Nov 28, 2023 •

edited

Loading

linux-foundation-easycla bot commented Nov 28, 2023 •

edited

Loading

k8s-ci-robot commented Nov 28, 2023

k8s-ci-robot commented Nov 28, 2023

yankay commented Dec 7, 2023

VannTen commented Dec 11, 2023

yuha0 commented Dec 14, 2023 •

edited

Loading

VannTen commented Jan 11, 2024

floryut left a comment

k8s-ci-robot commented Jan 22, 2024

Offline control plane recover #10660

Offline control plane recover #10660

Conversation

yuha0 commented Nov 28, 2023 • edited Loading

linux-foundation-easycla bot commented Nov 28, 2023 • edited Loading

k8s-ci-robot commented Nov 28, 2023

k8s-ci-robot commented Nov 28, 2023

yankay commented Dec 7, 2023

VannTen commented Dec 11, 2023

yuha0 commented Dec 14, 2023 • edited Loading

VannTen commented Jan 11, 2024

floryut left a comment

Choose a reason for hiding this comment

k8s-ci-robot commented Jan 22, 2024

yuha0 commented Nov 28, 2023 •

edited

Loading

linux-foundation-easycla bot commented Nov 28, 2023 •

edited

Loading

yuha0 commented Dec 14, 2023 •

edited

Loading