Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix containerd install for fcos #8107

Merged
merged 2 commits into from
Nov 5, 2021

Conversation

mafn
Copy link
Contributor

@mafn mafn commented Oct 21, 2021

What type of PR is this?

/kind bug

What this PR does / why we need it:
Allow deploying of containerd by download for Fedora CoreOS
Which issue(s) this PR fixes:

Fixes #8106

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

[containerd] Moved containerd and runc from `/usr/bin` to `bin_dir` (defaults to `/usr/local/bin`) - Fixing install for FCOS

@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Oct 21, 2021
@k8s-ci-robot
Copy link
Contributor

Welcome @mafn!

It looks like this is your first PR to kubernetes-sigs/kubespray 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/kubespray has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Oct 21, 2021
@k8s-ci-robot
Copy link
Contributor

Hi @mafn. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Oct 21, 2021
Copy link
Contributor

@oomichi oomichi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/cc @oomichi

@@ -1,5 +1,5 @@
---

runc_bin_dir: /usr/bin/
runc_bin_dir: "{{ bin_dir }}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to change this runc_bin_dir here?

The default value of bin_dir is /usr/local/bin, not /usr/bin.
This can affect the existing environments.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A proper solution for this would be to add the per-OS vars format to this role (see https://github.com/kubernetes-sigs/kubespray/blob/master/roles/container-engine/docker/tasks/main.yml#L14-L31). A solution I find we duplicate way too often in this code base, in the end it might be a good idea to move all of this per-os-version-minor-major-family code to the bootstrap-os role or to a similar role to set all os-version-specific variables.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The proper per os-solution is actually what happens to bin_dir and almost everything that is downloaded is put there. I'm not sure if downloading stuff to /usr/bin would be considered good practise.
runc_bin_dir is just used in that role so there are no other dependencies on the variable. AFAICS there are no actual dependencies on the location of runc as long as it is the PATH (and for fcos the position in PATH as we are not actually removing runc from the base image), but to be sure I'm testing the upgrade from 10c30ea to this PR on a test cluster to see if there is any breaking change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The upgrade went smoothly for both upgrade-cluster.yml as well as cluster.yml.
The only problem through the upgrade would be orphaned files in /usr/bin if run after ea8e2fc or if using kata-containers. That could be fixed by adding a cleanup step for the files in /usr/bin

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear, the purpose of the change was to keep location compatibility with the deployment coming from os packages but changing the containerd_bin_dir and runc_bin_dir to bin_dir does make overall sense from a clean deployment point of view.

Since you already tested and identified the corner case above, may I suggest you add the necessary fixes to this PR ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked at it again, Kata container already used /usr/local/bin so there is nothing to do, for the other case, I added a task to remove the binaries in /usr/bin (with ignore_errors: true for the case of a read-only /usr/bin as in fcos)

@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Oct 25, 2021
@cristicalin
Copy link
Contributor

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Oct 25, 2021
@cristicalin
Copy link
Contributor

Thanks for fixing this @mafn !

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 5, 2021
Copy link
Member

@floryut floryut left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mafn Thank you for the PR 🙇

@floryut floryut added kind/container-managers Containers section in the release note and removed kind/bug Categorizes issue or PR as related to a bug. labels Nov 5, 2021
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cristicalin, floryut, mafn

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 5, 2021
@k8s-ci-robot k8s-ci-robot merged commit c942915 into kubernetes-sigs:master Nov 5, 2021
@mafn mafn deleted the fix-containerd-coreos branch November 5, 2021 14:55
@floryut floryut mentioned this pull request Dec 21, 2021
@fungusakafungus
Copy link
Contributor

It seems that this has broken remove-node playbook:
After running an upgrade with kubespray v2.18.0, all pods but daemonsets are removed. daemonset pods, like e.g. kube-proxy stay there. They also stay running through a containerd restart. When remove-node tries to force-remove all containers, the following happens:

TASK [reset | force remove all cri containers] *********************************************
fatal: [etcd-2.kubernetes-....]: FAILED! => {"attempts": 5, "changed": true, "cmd": ["/usr/local/bin/crictl", "rm", "-a", "-f"], "delta": "0:00:00.034779", "end": "2022-02-08 13:49:14.837916", "msg": "non-zero return code", "rc": 1, "start": "2022-02-08 13:49:14.803137", "stderr": "time=\"2022-02-08T13:49:14+01:00\" level=error msg=\"stopping the container \\\"e57dbf0502659e005a0ccb79f8e592117410e26bd11eb61466bedba917718f94\\\" failed: rpc error: code = Unknown desc = failed to kill container \\\"e57dbf0502659e005a0ccb79f8e592117410e26bd11eb61466bedba917718f94\\\": unknown error after kill: fork/exec /usr/bin/runc: no such file or directory: : unknown\"\ntime=\"2022-02-08T13:49:14+01:00\" level=fatal msg=\"unable to remove container(s)\"", "stderr_lines": ["time=\"2022-02-08T13:49:14+01:00\" level=error msg=\"stopping the container \\\"e57dbf0502659e005a0ccb79f8e592117410e26bd11eb61466bedba917718f94\\\" failed: rpc error: code = Unknown desc = failed to kill container \\\"e57dbf0502659e005a0ccb79f8e592117410e26bd11eb61466bedba917718f94\\\": unk
nown error after kill: fork/exec /usr/bin/runc: no such file or directory: : unknown\"", "time=\"2022-02-08T13:49:14+01:00\" level=fatal msg=\"unable to remove container(s)\""], "stdout": "", "stdout_lines": []}

the error extracted for better visibility:

stopping the container "..." failed: rpc error: code = Unknown desc = failed to kill container "...": unknown error after kill: fork/exec /usr/bin/runc: no such file or directory: : unknown
unable to remove container(s)

this is caused by containerd-shim still running since before the upgrade to 2.18 and through the change of path to runc and calling runc like this:
execve("/usr/bin/runc", ["/usr/bin/runc", "--root", "/run/containerd/runc/k8s.io", "--log", "/run/containerd/io.containerd.runtime.v1.linux/k8s.io/e57dbf0502659e005a0ccb79f8e592117410e26bd11eb61466bedba917718f94/log.json", "--log-format", "json", "kill", "e57dbf0502659e005a0ccb79f8e592117410e26bd11eb61466bedba917718f94", "9"], 0xc00013e0c0 /* 6 vars */) = -1 ENOENT (No such file or directory)
and obviously failing, as there's no runc there anymore.

I probably should put this into a new ticket...

sakuraiyuta pushed a commit to sakuraiyuta/kubespray that referenced this pull request Apr 16, 2022
* Fix containerd install for fcos

* rm orphaned runc and containerd binaries
LuckySB pushed a commit to southbridgeio/kubespray that referenced this pull request Jun 28, 2023
* Fix containerd install for fcos

* rm orphaned runc and containerd binaries
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/container-managers Containers section in the release note lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

containerd download broken with Fedora CoreOS
6 participants