-
DescriptionHi, I'm using multiple clusters using this solution. Today, suddenly all the Longhorn RWX mounts stopped working in all of my clusters. Previously I used Longhorn 1.5.1, now I rolled back to 1.4.3 but the same. This is all I found in the logs:
I'm using a self installed longhorn, so it's disabled in the kube.tf, but this is the values.yaml I'm using. This also worked yesterday, so I'd say it's not related to the issue.
Do you have any ideas or advices on how to further debug it? Kube.tf filemodule "k8s" {
providers = {
hcloud = hcloud
}
hcloud_token = var.hcloud_token
version = "v2.7.0"
source = "kube-hetzner/kube-hetzner/hcloud"
cluster_name = "dev-worker-1"
ssh_public_key = file("./keys/id_ed25519.pub")
ssh_private_key = file("./keys/id_ed25519")
control_plane_nodepools = [
{
name = "control-plane-fsn1",
server_type = "cax21",
location = "fsn1",
labels = [],
taints = [],
count = 3
},
]
agent_nodepools = [
{
name = "agent-cpx31",
server_type = "cpx31",
location = "fsn1",
labels = [],
taints = [],
count = 0
},
{
name = "egress",
server_type = "cax11",
location = "fsn1",
labels = [
"node.kubernetes.io/role=egress"
],
taints = [
"node.kubernetes.io/role=egress:NoSchedule"
],
floating_ip = true
count = 1
},
{
name = "agent-arm-medium"
server_type = "cax31"
location = "fsn1"
count = 1
labels = [
"kubernetes.io/arch=arm64"
],
taints = [
"kubernetes.io/arch=arm64:NoSchedule"
],
},
]
autoscaler_nodepools = [
{
name = "autoscaled-small"
server_type = "cpx31"
location = "fsn1"
min_nodes = 0
max_nodes = 0
},
{
name = "autoscaled-medium"
server_type = "cpx41"
location = "nbg1"
min_nodes = 2
max_nodes = 3
},
]
network_region = "eu-central"
load_balancer_type = "lb11"
load_balancer_location = "fsn1"
# Use dedicated load balancer for control plane
use_control_plane_lb = true
restrict_outbound_traffic = false
# Use cilium as CNI as it supports egress nodes
cni_plugin = "cilium"
cluster_ipv4_cidr = local.cluster_ipv4_cidr
cilium_values = <<EOT
ipam:
operator:
clusterPoolIPv4PodCIDRList:
- ${local.cluster_ipv4_cidr}
kubeProxyReplacement: strict
l7Proxy: "false"
bpf:
masquerade: "true"
egressGateway:
enabled: "true"
extraConfig:
mtu: "1450"
EOT
# Install it manually
enable_longhorn = false
# Install prometheus + prometheus-adapter instead
enable_metrics_server = false
# Install it manually
ingress_controller = "none"
cluster_autoscaler_extra_args = [
"--enforce-node-group-min-size=true"
]
create_kubeconfig = false
create_kustomization = false
initial_k3s_channel = "v1.27"
automatically_upgrade_k3s = false
automatically_upgrade_os = false
k3s_exec_server_args = "--kube-apiserver-arg enable-admission-plugins=PodTolerationRestriction,PodNodeSelector"
} ScreenshotsNo response PlatformLinux |
Beta Was this translation helpful? Give feedback.
Replies: 25 comments 3 replies
-
@janosmiko You might want to SSH into a node and see the logs. Please refer to the Debug section in the readme. @aleksasiriski Maybe you would know something about that specific issue? |
Beta Was this translation helpful? Give feedback.
-
Hi @mysticaltech , I found those logs in the node's journalctl (where the pod that needs the RWX volume). And actually when all the rest is working well (eg: a POD with RWO volume works as expected). |
Beta Was this translation helpful? Give feedback.
-
@janosmiko Please have a look if the nfs packages are installed. If not, make sure you are using the latest version of the nodes. See the packer file how nfs is installed and do the same manually, if that solves it, we would have identified the problem. |
Beta Was this translation helpful? Give feedback.
-
Sure, it's installed:
|
Beta Was this translation helpful? Give feedback.
-
Maybe this one is related? |
Beta Was this translation helpful? Give feedback.
-
I just did a rollback to the previous version of MicroOS (it did an auto-upgrade midnight) and now the issue is solved on that node. They definitely updated For anyone who faces the same issue:
If you want to disable system upgrade manually:
@mysticaltech I think there's also a bug in the terraform module.
And none of them seems to work. |
Beta Was this translation helpful? Give feedback.
-
Same here. Disaster :-/ |
Beta Was this translation helpful? Give feedback.
-
@janosmiko The upgrade flags are not retroactive, they take effect on the first deployment only. But see the upgrade section in the readme, you can disable it manually. About the nfs-client, just freeze the version with zypper (via transactional-update shell). After the version are frozen, you can let the upgrade be. |
Beta Was this translation helpful? Give feedback.
-
Can these be applied on the autoscaled nodes too? Freezing the nfs-client can be a good solution for the already existing nodes, but autoscaled (newly created) nodes will be created using the new package version. :/ |
Beta Was this translation helpful? Give feedback.
-
I reported it here: |
Beta Was this translation helpful? Give feedback.
-
@janosmiko Yes you can ssh into autoscaled nodes too. And what you could do is freeze the version at the packer level and publish the new snapshot (just apply packer again, see readme). |
Beta Was this translation helpful? Give feedback.
-
@janosmiko @Robert-turbo If you folks can give me the working version of the nfs-client I will freeze it at the packer level. These kind of packages do not need to get updated often. (then you can just recreate the packer image again, I will tell you how, just one command, so that all new nodes get a working version). |
Beta Was this translation helpful? Give feedback.
-
Working version:
Problematic version:
|
Beta Was this translation helpful? Give feedback.
-
Folks see solution here #1018, and also PR merging right away to avoid this problem in the future. |
Beta Was this translation helpful? Give feedback.
-
Should be fixed in v2.8.0 but the image update is needed, please follow the steps laid out in #794. |
Beta Was this translation helpful? Give feedback.
-
(The solution was to install and freeze an older version of nfs-client), see the solution linked above for manual fixes). |
Beta Was this translation helpful? Give feedback.
-
Hi @mysticaltech , And it still doesn't work. I tested it even by manually pinning the nfs-client version only on all my nodes and the RWX longhorn volumes are still not mounted. https://bugzilla.suse.com/show_bug.cgi?id=1216201#c2 Also, installing x86-64 package on the arm snapshots will not work. |
Beta Was this translation helpful? Give feedback.
-
@mysticaltech See the progress in the related bugreport: longhorn/longhorn#6857 |
Beta Was this translation helpful? Give feedback.
-
Thanks for the details @janosmiko, you are right. Will revert the change to pin the version and wait for more feedback on this issue. |
Beta Was this translation helpful? Give feedback.
-
The changes to the base images pinning nfs-client were reverted in v2.8.1. |
Beta Was this translation helpful? Give feedback.
-
@janosmiko As this is a longhorn bug, there is nothing else we can do here, closing for now. Thanks again for all the research and the info. |
Beta Was this translation helpful? Give feedback.
-
It's not a Longhorn bug, but actually a bug in the Linux kernel. For anyone who faces the same issue and wants a real (and tested) solution... SSH to all your worker nodes and run these commands:
If you'd like to make sure the autoscaled nodes also have this pinned kernel, delete the previous snapshots from hcloud, then modify the packer config with these:
and rerun Wait for the images to be built and finally run |
Beta Was this translation helpful? Give feedback.
-
Note: @mysticaltech I saw it somewhere else too but it's kinda said it's more important to you to close bug reports asap instead of helping your users who are in trouble and trying to use your project. |
Beta Was this translation helpful? Give feedback.
-
@janosmiko Absolutely not, I help the best I can, I even spent a few hours Sunday night implementing a fix you suggested that turned out not to be it. Now the bug is from Longhorn, there is nothing to do on that side but warning about it, hence pinning the issue. |
Beta Was this translation helpful? Give feedback.
-
I haven't suggested that fix, you talked with another guy in the linked discussion about that. But I assume none of you tested if it really works and solves the issue.
I mentioned a temporary solution in the comment above: #1016 (comment) |
Beta Was this translation helpful? Give feedback.
It's not a Longhorn bug, but actually a bug in the Linux kernel.
For anyone who faces the same issue and wants a real (and tested) solution...
SSH to all your worker nodes and run these commands:
If you'd like to make sure the autoscaled nodes also have this pinned kernel, delete the previous snapshots from hcloud, then modify the packer config with these: