Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Storage option changes in CRI-O configuration requires a reboot to be taken into account #8322

Open
visheshtanksale opened this issue Jun 27, 2024 · 18 comments
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@visheshtanksale
Copy link

What happened?

Setup Kata using kata deploy on CRI-O.
When creating a test pod I get error below

Jun 21 11:51:02 ipp1-1848 crio[613259]: time="2024-06-21T11:51:02.647812257Z" level=error msg="createContainer failed" error="rpc error: code = Internal desc = the file /bin/bash was not found" name=containerd-shim-v2 pid=614221 

If I try to bring up any other container using kata-qemu runtime I get similar error that the command which is entrypoint of the container is not found

Attached crio log here
Attached kata log here

Qemu and kata version are below

[Hypervisor]
  MachineType = "q35"
  Version = "QEMU emulator version 7.2.0 (kata-static)\nCopyright (c) 2003-2022 Fabrice Bellard and the QEMU Project developers"
  Path = "/opt/kata/bin/qemu-system-x86_64"
  BlockDeviceDriver = "virtio-scsi"
  EntropySource = "/dev/urandom"
  SharedFS = "virtio-fs"
  VirtioFSDaemon = "/opt/kata/libexec/virtiofsd"
  SocketPath = ""
  Msize9p = 8192
  MemorySlots = 10
  HotPlugVFIO = "no-port"
  ColdPlugVFIO = "no-port"
  PCIeRootPort = 0
  PCIeSwitchPort = 0
  Debug = true
  [Hypervisor.SecurityInfo]
    Rootless = false
    DisableSeccomp = false
    GuestHookPath = ""
    EnableAnnotations = ["enable_iommu", "virtio_fs_extra_args", "kernel_params"]
    ConfidentialGuest = false

[Runtime]
  Path = "/opt/kata/bin/kata-runtime"
  GuestSeLinuxLabel = ""
  Debug = true
  Trace = false
  DisableGuestSeccomp = true
  DisableNewNetNs = false
  SandboxCgroupOnly = false
  [Runtime.Config]
    Path = "/opt/kata/share/defaults/kata-containers/configuration-qemu.toml"
  [Runtime.Version]
    OCI = "1.1.0+dev"
    [Runtime.Version.Version]
      Semver = "3.5.0"
      Commit = "cce735a09e7374ee52a3b4f5d4a4923e9af07f73"
      Major = 3
      Minor = 5
      Patch = 0
      

Opened an issue on kata-containers
@littlejawa suggest adding the storage overlay config

[crio]
  storage_option = [
	"overlay.skip_mount_home=true",
  ]

But this doesnt help.

What did you expect to happen?

The pod should come up without error

How can we reproduce it (as minimally and precisely as possible)?

  • Install Kata deploy on a host with CRI-O with details mentioned here
  • Create a pod with kata-qemu runtime class

Anything else we need to know?

No response

CRI-O and Kubernetes version

$ crio --version
crio version 1.31.0
Version:        1.31.0
GitCommit:      004b5dc40823f9bce9b34c6da2a769778725c0f5
GitCommitDate:  2024-06-18T16:24:04Z
GitTreeState:   clean
BuildDate:      1970-01-01T00:00:00Z
GoVersion:      go1.22.3
Compiler:       gc
Platform:       linux/amd64
Linkmode:       static
BuildTags:
  static
  netgo
  osusergo
  exclude_graphdriver_btrfs
  exclude_graphdriver_devicemapper
  seccomp
  apparmor
  selinux
  exclude_graphdriver_devicemapper
LDFlags:          unknown
SeccompEnabled:   true
AppArmorEnabled:  true
$ kubectl version --output=json
{
  "clientVersion": {
    "major": "1",
    "minor": "28",
    "gitVersion": "v1.28.11",
    "gitCommit": "f25b321b9ae42cb1bfaa00b3eec9a12566a15d91",
    "gitTreeState": "clean",
    "buildDate": "2024-06-11T20:20:18Z",
    "goVersion": "go1.21.11",
    "compiler": "gc",
    "platform": "linux/amd64"
  },
  "kustomizeVersion": "v5.0.4-0.20230601165947-6ce0bf390ce3",
  "serverVersion": {
    "major": "1",
    "minor": "28",
    "gitVersion": "v1.28.11",
    "gitCommit": "f25b321b9ae42cb1bfaa00b3eec9a12566a15d91",
    "gitTreeState": "clean",
    "buildDate": "2024-06-11T20:11:29Z",
    "goVersion": "go1.21.11",
    "compiler": "gc",
    "platform": "linux/amd64"
  }
}

OS version

# On Linux:
$ cat /etc/os-release
PRETTY_NAME="Ubuntu 22.04.3 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04.3 LTS (Jammy Jellyfish)"
VERSION_CODENAME=jammy
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=jammy
$ uname -a
Linux ipp1-1848 5.15.0-101-generic #111-Ubuntu SMP Tue Mar 5 20:16:58 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

Additional environment details (AWS, VirtualBox, physical, etc.)

@visheshtanksale visheshtanksale added the kind/bug Categorizes issue or PR as related to a bug. label Jun 27, 2024
@visheshtanksale
Copy link
Author

cc: @zvonkok

@haircommander
Copy link
Member

@littlejawa is this something you're helping with or are you looking for reinforcements?

@zvonkok
Copy link

zvonkok commented Jun 27, 2024

@haircommander Yes, he is helping with that, and we're currently out of options and need reinforcements.

@haircommander
Copy link
Member

what happens when you create the container with a different oci runtime?

@visheshtanksale
Copy link
Author

what happens when you create the container with a different oci runtime?

Non kata containers are created successfully.

@littlejawa
Copy link
Contributor

The symptom is similar to what we saw with kata 3.3.0, where the content of the container's rootfs was not accessible to the runtime.
We fixed it in our own CI by adding the flag "storage.overlay.skip_mount_home=true" in crio's config.
I'm also fixing it in the same way in the crio CI for kata, in #7958.

In this cluster the flag was not there, so we added it, but it didn't solve the problem.
Could crio ignore the flag for some reason? What else could cause the same symptom?

@fidencio
Copy link
Contributor

fidencio commented Jul 3, 2024

After some experiments from my side, this is what I learned.

[crio]
storage_option = [
  "overlay.skip_mount_home=true",
]

If ^^^ is set before kubernetes is deployed, we're good.
If ^^^ is set after kuberentes is deployed, restarting cri-o / kubelet does not solve the issue, although a full reboot does.

I'm also added the same comment to the Kata Containers issue.

@littlejawa
Copy link
Contributor

Hey @haircommander,

I think we need your brain here :)

Crio was taking our change into account (according to its logs), but kata still couldn't access the files from the container rootfs, meaning that the mount was still wrong.
We managed to make the cluster work, by rebooting the node. Reloading / restarting crio multiple times didn't help.

Is it because the layers were already mounted with the wrong flag, and not updated as part of the reload/restart?
If so, is there anything else we could have done to make them remounted properly?

Is rebooting the node the right way to make this setting applied ?

@haircommander
Copy link
Member

Is it because the layers were already mounted with the wrong flag, and not updated as part of the reload/restart?
If so, is there anything else we could have done to make them remounted properly?

yeah that makes sense to me. I think the only way to fix it would be to remove the containers and images. Rebooting is probably least intrusive

@littlejawa
Copy link
Contributor

I see two things here:

  1. This issue is not about kata. I can't edit the title, but I think it should be something like : "Storage option changes in crio config requires a reboot to be taken into account"

  2. Do we want to fix it?
    Removing all images/containers is not something that I expect CRI-O to do by itself on every reload/restart.
    Even if we limit it to this specific kind of configuration change (assuming we can tell that it's a new setting) it can be very impactful.
    On the other hand, being one of the guys who scratched their heads trying to understand what was going on, can we add some warning (maybe as comments in the conf file) to make sure people are aware they may need to reboot if they change it?

@kwilczynski
Copy link
Member

/retitle Storage option changes in CRI-O configuration requires a reboot to be taken into account

@openshift-ci openshift-ci bot changed the title Pod creation fails with CRI-O on kata-qemu runtime Storage option changes in CRI-O configuration requires a reboot to be taken into account Jul 11, 2024
@kwilczynski
Copy link
Member

[...]

  1. This issue is not about kata. I can't edit the title, but I think it should be something like : "Storage option changes in crio config requires a reboot to be taken into account"

@littlejawa, this is a restart of the guest virtual machine, correct? I hope that the host on which CRI-O runs does not require that.

@littlejawa
Copy link
Contributor

No, we're talking about the host unfortunately.

The problem is as follows:

  • CRI-O runs with some storage options
  • you change those options (here: asking to skip the private bind mount) and restart/reload CRI-O
    => the change is not taken into account (at least not for existing images/containers, if I understand correctly).

The way to make it taken into account is to reboot the node.
That's bad, but the alternative seems to be: remove all images/containers... so maybe rebooting is the lesser of two evils :-(

Copy link

A friendly reminder that this issue had no activity for 30 days.

@github-actions github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 11, 2024
@kwilczynski kwilczynski removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 25, 2024
Copy link

A friendly reminder that this issue had no activity for 30 days.

@github-actions github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 25, 2024
@zvonkok
Copy link

zvonkok commented Sep 26, 2024

/remove-lifecycle-stale

@github-actions github-actions bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 27, 2024
Copy link

A friendly reminder that this issue had no activity for 30 days.

@github-actions github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 28, 2024
@zvonkok
Copy link

zvonkok commented Oct 28, 2024

/remove-lifecycle-stale

@github-actions github-actions bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

6 participants