Longhorn RWX volumes are not attached #1035

janosmiko · 2023-10-11T12:09:18Z

janosmiko
Oct 11, 2023

Description

Hi,

I'm using multiple clusters using this solution. Today, suddenly all the Longhorn RWX mounts stopped working in all of my clusters.

Previously I used Longhorn 1.5.1, now I rolled back to 1.4.3 but the same.

This is all I found in the logs:

Oct 11 07:42:23 dev-worker-1 k3s[1294]: Mounting command: /usr/local/sbin/nsmounter
Oct 11 07:42:23 dev-worker-1 k3s[1294]: Mounting arguments: mount -t nfs -o vers=4.1,noresvport,intr,hard 10.43.207.185:/pvc-13538170-4278-4467-b2b0-1f1ba6f54a4c /var/lib/kubelet/plugins/kubernetes.io/csi/driver.longhorn.io/185c34f566c2eca6e8c7c6a2ede2094c076d7d25ddae286dc633eeef80551af0/globalmount
Oct 11 07:42:23 dev-worker-1-autoscaled-small-19baf778f50efd8c k3s[1294]: Output: mount.nfs: Protocol not supported for 10.43.207.185:/pvc-13538170-4278-4467-b2b0-1f1ba6f54a4c on /var/lib/kubelet/plugins/kubernetes.io/csi/driver.longhorn.io/185c34f566c2eca6e8c7c6a2ede2094c076d7d25ddae286dc633eeef80551af0/globalmount

I'm using a self installed longhorn, so it's disabled in the kube.tf, but this is the values.yaml I'm using. This also worked yesterday, so I'd say it's not related to the issue.

ingress:
  enabled: true
  annotations:
    kubernetes.io/tls-acme: "true"
    cert-manager.io/cluster-issuer: "letsencrypt-prod"
  ingressClassName: "nginx"
  host: "longhorn.blabla"
  tls: true
  tlsSecret: longhorn.blabla-tls

longhornManager:
  tolerations:
    - key: kubernetes.io/arch
      operator: Equal
      value: arm64
      effect: NoSchedule
longhornDriver:
  tolerations:
    - key: kubernetes.io/arch
      operator: Equal
      value: arm64
      effect: NoSchedule
longhornUI:
  tolerations:
    - key: kubernetes.io/arch
      operator: Equal
      value: arm64
      effect: NoSchedule
longhornConversionWebhook:
  tolerations:
    - key: kubernetes.io/arch
      operator: Equal
      value: arm64
      effect: NoSchedule
longhornAdmissionWebhook:
  tolerations:
    - key: kubernetes.io/arch
      operator: Equal
      value: arm64
      effect: NoSchedule
longhornRecoveryBackend:
  tolerations:
    - key: kubernetes.io/arch
      operator: Equal
      value: arm64
      effect: NoSchedule
persistence:
  defaultClassReplicaCount: 3
  defaultClass: false
  defaultReplicaAutoBalance: least-effort
defaultSettings:
  taintToleration: "kubernetes.io/arch=arm64:NoSchedule"
  storageOverProvisioningPercentage: 1000
  kubernetesClusterAutoscalerEnabled: "true"
  defaultReplicaCount: 3

Do you have any ideas or advices on how to further debug it?

Kube.tf file

module "k8s" {
  providers = {
    hcloud = hcloud
  }
  hcloud_token = var.hcloud_token
  version = "v2.7.0"

  source = "kube-hetzner/kube-hetzner/hcloud"

  cluster_name = "dev-worker-1"

  ssh_public_key  = file("./keys/id_ed25519.pub")
  ssh_private_key = file("./keys/id_ed25519")

  control_plane_nodepools = [
    {
      name        = "control-plane-fsn1",
      server_type = "cax21",
      location    = "fsn1",
      labels      = [],
      taints      = [],
      count       = 3
    },
  ]


  agent_nodepools = [
    {
      name        = "agent-cpx31",
      server_type = "cpx31",
      location    = "fsn1",
      labels      = [],
      taints      = [],
      count       = 0
    },
    {
      name        = "egress",
      server_type = "cax11",
      location    = "fsn1",
      labels      = [
        "node.kubernetes.io/role=egress"
      ],
      taints = [
        "node.kubernetes.io/role=egress:NoSchedule"
      ],
      floating_ip = true
      count       = 1
    },
    {
      name        = "agent-arm-medium"
      server_type = "cax31"
      location    = "fsn1"
      count = 1
      labels      = [
        "kubernetes.io/arch=arm64"
      ],
      taints = [
        "kubernetes.io/arch=arm64:NoSchedule"
      ],
    },
  ]

  autoscaler_nodepools = [
    {
      name        = "autoscaled-small"
      server_type = "cpx31"
      location    = "fsn1"
      min_nodes   = 0
      max_nodes   = 0
    },
    {
      name        = "autoscaled-medium"
      server_type = "cpx41"
      location    = "nbg1"
      min_nodes   = 2
      max_nodes   = 3
    },
  ]

  network_region = "eu-central"

  load_balancer_type     = "lb11"
  load_balancer_location = "fsn1"

  # Use dedicated load balancer for control plane
  use_control_plane_lb = true

  restrict_outbound_traffic = false

  # Use cilium as CNI as it supports egress nodes
  cni_plugin        = "cilium"
  cluster_ipv4_cidr = local.cluster_ipv4_cidr
  cilium_values     = <<EOT
ipam:
  operator:
    clusterPoolIPv4PodCIDRList:
      - ${local.cluster_ipv4_cidr}
kubeProxyReplacement: strict
l7Proxy: "false"
bpf:
  masquerade: "true"
egressGateway:
  enabled: "true"
extraConfig:
  mtu: "1450"
EOT

  # Install it manually
  enable_longhorn       = false
  # Install prometheus + prometheus-adapter instead
  enable_metrics_server = false
  # Install it manually
  ingress_controller    = "none"

  cluster_autoscaler_extra_args = [
    "--enforce-node-group-min-size=true"
  ]

  create_kubeconfig    = false
  create_kustomization = false

  initial_k3s_channel = "v1.27"
  automatically_upgrade_k3s = false
  automatically_upgrade_os = false

  k3s_exec_server_args = "--kube-apiserver-arg enable-admission-plugins=PodTolerationRestriction,PodNodeSelector"
}

Screenshots

No response

Platform

Linux

Answered by janosmiko

Oct 18, 2023

It's not a Longhorn bug, but actually a bug in the Linux kernel.

For anyone who faces the same issue and wants a real (and tested) solution...

SSH to all your worker nodes and run these commands:

transactional-update shell
zypper install -y --oldpackage https://download.opensuse.org/history/20231008/tumbleweed/repo/oss/x86_64/kernel-default-6.5.4-1.1.x86_64.rpm
zypper addlock kernel-default
exit

touch /var/run/reboot-required

If you'd like to make sure the autoscaled nodes also have this pinned kernel, delete the previous snapshots from hcloud, then modify the packer config with these:

  install_packages = <<-EOT
    set -ex
    echo "First reboot successful, installing needed packages.…

View full answer

mysticaltech · 2023-10-11T13:22:29Z

mysticaltech
Oct 11, 2023
Maintainer

@janosmiko You might want to SSH into a node and see the logs. Please refer to the Debug section in the readme.

@aleksasiriski Maybe you would know something about that specific issue?

0 replies

janosmiko · 2023-10-11T13:25:49Z

janosmiko
Oct 11, 2023
Author

Hi @mysticaltech ,

I found those logs in the node's journalctl (where the pod that needs the RWX volume). And actually when all the rest is working well (eg: a POD with RWO volume works as expected).

0 replies

mysticaltech · 2023-10-11T15:55:48Z

mysticaltech
Oct 11, 2023
Maintainer

@janosmiko Please have a look if the nfs packages are installed. If not, make sure you are using the latest version of the nodes.

See the packer file how nfs is installed and do the same manually, if that solves it, we would have identified the problem.

0 replies

janosmiko · 2023-10-11T16:00:25Z

janosmiko
Oct 11, 2023
Author

Sure, it's installed:

dev-worker-1-autoscaled-medium-47604670eec959dc:/ # zypper search --installed-only nfs-client
Loading repository data...
Reading installed packages...

S  | Name       | Summary                   | Type
---+------------+---------------------------+--------
i+ | nfs-client | Support Utilities for NFS | package

0 replies

janosmiko · 2023-10-11T16:27:38Z

janosmiko
Oct 11, 2023
Author

Maybe this one is related?
longhorn/longhorn#6857

0 replies

janosmiko · 2023-10-11T20:23:50Z

janosmiko
Oct 11, 2023
Author

I just did a rollback to the previous version of MicroOS (it did an auto-upgrade midnight) and now the issue is solved on that node.
I found this issue and I think they messed up the nfs client somehow...
https://bugzilla.opensuse.org/show_bug.cgi?id=1214540

They definitely updated nfs-client in the last couple of days from 2.6.3-39.4 to 2.6.3-39.5, possibly the issue lies there.

For anyone who faces the same issue:

# check your current snapshots
ls -lah /.snapshots/

...
drwxr-xr-x. 1 root root   66 Oct  8 00:13 33
...
drwxr-xr-x. 1 root root   66 Oct 11 19:08 45

# check the snapshot you want to rollback to (in my case I rollback to 33, as the date of that snapshot was 3 days ago)
transactional-update rollback 33

# and you have to reboot when the task is done
reboot

If you want to disable system upgrade manually:

systemctl disable --now transactional-update.timer

@mysticaltech I think there's also a bug in the terraform module.
I added these two lines after creating the cluster:

  automatically_upgrade_k3s = false
  automatically_upgrade_os = false

And none of them seems to work.

0 replies

carstenblt · 2023-10-12T07:29:57Z

carstenblt
Oct 12, 2023

Same here. Disaster :-/

0 replies

mysticaltech · 2023-10-12T22:48:39Z

mysticaltech
Oct 12, 2023
Maintainer

@janosmiko The upgrade flags are not retroactive, they take effect on the first deployment only. But see the upgrade section in the readme, you can disable it manually.

About the nfs-client, just freeze the version with zypper (via transactional-update shell).

After the version are frozen, you can let the upgrade be.

0 replies

janosmiko · 2023-10-13T06:59:18Z

janosmiko
Oct 13, 2023
Author

The upgrade flags are not retroactive, they take effect on the first deployment only. But see the upgrade section in the readme, you can disable it manually.

Can these be applied on the autoscaled nodes too?
Eg: I created the cluster with automated upgrades = true, but I want to disable automated upgrades. If I change this in terraform and apply it on the cluster, that will make sure the autoscaled nodes will not be created using automated upgrades and I only have to disable it manually on the already existing nodes?

Freezing the nfs-client can be a good solution for the already existing nodes, but autoscaled (newly created) nodes will be created using the new package version. :/

0 replies

janosmiko · 2023-10-13T07:35:57Z

janosmiko
Oct 13, 2023
Author

I reported it here:
https://bugzilla.opensuse.org/show_bug.cgi?id=1216201

0 replies

mysticaltech · 2023-10-13T10:30:57Z

mysticaltech
Oct 13, 2023
Maintainer

@janosmiko Yes you can ssh into autoscaled nodes too. And what you could do is freeze the version at the packer level and publish the new snapshot (just apply packer again, see readme).

0 replies

mysticaltech · 2023-10-13T10:35:07Z

mysticaltech
Oct 13, 2023
Maintainer

@janosmiko @Robert-turbo If you folks can give me the working version of the nfs-client I will freeze it at the packer level. These kind of packages do not need to get updated often. (then you can just recreate the packer image again, I will tell you how, just one command, so that all new nodes get a working version).

0 replies

primeXchange · 2023-10-13T11:24:51Z

primeXchange
Oct 13, 2023

@mysticaltech

Working version:

S  | Name                            | Type    | Version                        | Arch   | Repository
---+---------------------------------+---------+--------------------------------+--------+------------------------
i+ | nfs-client                      | package | 2.6.3-39.4                     | x86_64 | (System Packages)

Problematic version:

S  | Name                            | Type    | Version                        | Arch   | Repository
---+---------------------------------+---------+--------------------------------+--------+------------------------
i+ | nfs-client                      | package | 2.6.3-39.5                     | x86_64 | openSUSE-Tumbleweed-Oss

0 replies

mysticaltech · 2023-10-13T18:48:37Z

mysticaltech
Oct 13, 2023
Maintainer

Folks see solution here #1018, and also PR merging right away to avoid this problem in the future.

0 replies

mysticaltech · 2023-10-13T20:12:59Z

mysticaltech
Oct 13, 2023
Maintainer

Should be fixed in v2.8.0 but the image update is needed, please follow the steps laid out in #794.

0 replies

mysticaltech · 2023-10-13T20:13:43Z

mysticaltech
Oct 13, 2023
Maintainer

(The solution was to install and freeze an older version of nfs-client), see the solution linked above for manual fixes).

0 replies

janosmiko · 2023-10-16T06:34:16Z

janosmiko
Oct 16, 2023
Author

Hi @mysticaltech ,
Could you reopen this issue and wait for some feedback from real users please? Just asking because it's a critical issue for those who use this in production...

And it still doesn't work. I tested it even by manually pinning the nfs-client version only on all my nodes and the RWX longhorn volumes are still not mounted.
Also, Neil Brown mentioned in the related bugreport that this package update (nfs-client 2.6.3-39.4 -> 2.6.3-39.5) doesn't contain any changes, so the issue must be in another package.

https://bugzilla.suse.com/show_bug.cgi?id=1216201#c2

Also, installing x86-64 package on the arm snapshots will not work.
#1026 (review)

0 replies

janosmiko · 2023-10-16T08:44:08Z

janosmiko
Oct 16, 2023
Author

@mysticaltech
It looks like downgrading the kernel-default package to 6.5.4-1.1 and a reboot solves the issue.
You don't have to downgrade and pin the nfs-client package.

See the progress in the related bugreport: longhorn/longhorn#6857

0 replies

mysticaltech · 2023-10-16T13:34:22Z

mysticaltech
Oct 16, 2023
Maintainer

Thanks for the details @janosmiko, you are right. Will revert the change to pin the version and wait for more feedback on this issue.

0 replies

mysticaltech · 2023-10-17T02:11:52Z

mysticaltech
Oct 17, 2023
Maintainer

The changes to the base images pinning nfs-client were reverted in v2.8.1.

0 replies

mysticaltech · 2023-10-17T21:49:16Z

mysticaltech
Oct 17, 2023
Maintainer

@janosmiko As this is a longhorn bug, there is nothing else we can do here, closing for now. Thanks again for all the research and the info.

0 replies

janosmiko · 2023-10-18T05:28:04Z

janosmiko
Oct 18, 2023
Author

It's not a Longhorn bug, but actually a bug in the Linux kernel.

For anyone who faces the same issue and wants a real (and tested) solution...

SSH to all your worker nodes and run these commands:

transactional-update shell
zypper install -y --oldpackage https://download.opensuse.org/history/20231008/tumbleweed/repo/oss/x86_64/kernel-default-6.5.4-1.1.x86_64.rpm
zypper addlock kernel-default
exit

touch /var/run/reboot-required

If you'd like to make sure the autoscaled nodes also have this pinned kernel, delete the previous snapshots from hcloud, then modify the packer config with these:

  install_packages = <<-EOT
    set -ex
    echo "First reboot successful, installing needed packages..."
    transactional-update --continue pkg install -y ${local.needed_packages}
    transactional-update --continue shell <<- EOF
    setenforce 0
    rpm --import https://rpm.rancher.io/public.key
    zypper install -y https://github.com/k3s-io/k3s-selinux/releases/download/v1.4.stable.1/k3s-selinux-1.4-1.sle.noarch.rpm
    zypper addlock k3s-selinux
    zypper install -y "https://download.opensuse.org/history/20231008/tumbleweed/repo/oss/x86_64/kernel-default-6.5.4-1.1.x86_64.rpm"
    zypper addlock kernel-default
    restorecon -Rv /etc/selinux/targeted/policy
    restorecon -Rv /var/lib
    setenforce 1
    EOF
    sleep 1 && udevadm settle && reboot
  EOT

and rerun packer init hcloud-microos-snapshots.pkr.hcl && packer build hcloud-microos-snapshots.pkr.hcl

Wait for the images to be built and finally run terraform apply to update the cluster autoscaler config.

0 replies

janosmiko · 2023-10-18T05:33:19Z

janosmiko
Oct 18, 2023
Author

Note: @mysticaltech I saw it somewhere else too but it's kinda said it's more important to you to close bug reports asap instead of helping your users who are in trouble and trying to use your project.

0 replies

mysticaltech · 2023-10-18T06:50:35Z

mysticaltech
Oct 18, 2023
Maintainer

@janosmiko Absolutely not, I help the best I can, I even spent a few hours Sunday night implementing a fix you suggested that turned out not to be it. Now the bug is from Longhorn, there is nothing to do on that side but warning about it, hence pinning the issue.

0 replies

janosmiko · 2023-10-18T07:10:15Z

janosmiko
Oct 18, 2023
Author

you suggested that turned out not to be it

I haven't suggested that fix, you talked with another guy in the linked discussion about that. But I assume none of you tested if it really works and solves the issue.

there is nothing to do on that side but warning about it

I mentioned a temporary solution in the comment above: #1016 (comment)

3 replies

mysticaltech Oct 18, 2023
Maintainer

This is great, @janosmiko, I do appreciate it. People now have the solution thanks to you.

mysticaltech Oct 18, 2023
Maintainer

As for the linux kernel bug it will get fixed soon enough, probably during the next week or so. It's a rolling release. This matter will auto-resolve.

But good lesson learned, it's probably safer to turn off automatic node upgrade by default.

mysticaltech Oct 18, 2023
Maintainer

As for your remarks @janosmiko, you have to understand that this is an open source project without financial backing, it's not a paid product with customers. We cannot do everything ourselves and rely on community contributors like you to help us with bug fixes. Which you did, and I am glad for that.

Longhorn RWX volumes are not attached #1035

janosmiko Oct 11, 2023

Description

Kube.tf file

Screenshots

Platform

Replies: 25 comments · 3 replies

mysticaltech Oct 11, 2023 Maintainer

janosmiko Oct 11, 2023 Author

mysticaltech Oct 11, 2023 Maintainer

janosmiko Oct 11, 2023 Author

janosmiko Oct 11, 2023 Author

janosmiko Oct 11, 2023 Author

carstenblt Oct 12, 2023

mysticaltech Oct 12, 2023 Maintainer

janosmiko Oct 13, 2023 Author

janosmiko Oct 13, 2023 Author

mysticaltech Oct 13, 2023 Maintainer

mysticaltech Oct 13, 2023 Maintainer

primeXchange Oct 13, 2023

mysticaltech Oct 13, 2023 Maintainer

mysticaltech Oct 13, 2023 Maintainer

mysticaltech Oct 13, 2023 Maintainer

janosmiko Oct 16, 2023 Author

janosmiko Oct 16, 2023 Author

mysticaltech Oct 16, 2023 Maintainer

mysticaltech Oct 17, 2023 Maintainer

mysticaltech Oct 17, 2023 Maintainer

janosmiko Oct 18, 2023 Author

janosmiko Oct 18, 2023 Author

mysticaltech Oct 18, 2023 Maintainer

janosmiko Oct 18, 2023 Author

mysticaltech Oct 18, 2023 Maintainer

mysticaltech Oct 18, 2023 Maintainer

mysticaltech Oct 18, 2023 Maintainer

janosmiko
Oct 11, 2023

Replies: 25 comments 3 replies

mysticaltech
Oct 11, 2023
Maintainer

janosmiko
Oct 11, 2023
Author

mysticaltech
Oct 11, 2023
Maintainer

janosmiko
Oct 11, 2023
Author

janosmiko
Oct 11, 2023
Author

janosmiko
Oct 11, 2023
Author

carstenblt
Oct 12, 2023

mysticaltech
Oct 12, 2023
Maintainer

janosmiko
Oct 13, 2023
Author

janosmiko
Oct 13, 2023
Author

mysticaltech
Oct 13, 2023
Maintainer

mysticaltech
Oct 13, 2023
Maintainer

primeXchange
Oct 13, 2023

mysticaltech
Oct 13, 2023
Maintainer

mysticaltech
Oct 13, 2023
Maintainer

mysticaltech
Oct 13, 2023
Maintainer

janosmiko
Oct 16, 2023
Author

janosmiko
Oct 16, 2023
Author

mysticaltech
Oct 16, 2023
Maintainer

mysticaltech
Oct 17, 2023
Maintainer

mysticaltech
Oct 17, 2023
Maintainer

janosmiko
Oct 18, 2023
Author

janosmiko
Oct 18, 2023
Author

mysticaltech
Oct 18, 2023
Maintainer

janosmiko
Oct 18, 2023
Author

mysticaltech Oct 18, 2023
Maintainer

mysticaltech Oct 18, 2023
Maintainer

mysticaltech Oct 18, 2023
Maintainer