-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix ceph errors on the 6.0 kernel #1393
Comments
I confirm I have pretty much the same issue, with a slightly different setup: As soon as I upgrade an OSD node to
Thanks to @dustymabe for the idea, I tried the In case you want to try that yourself, the command to override the kernel is:
|
hey @darknao - thanks for the update and for testing out the This tells us that there was a problem introduced in this transition:
To narrow down the search for the problem commit could you possibly try with the
Sometimes with issues like these we also find that the problem has been found and fixed upstream already.. If you could try with the |
Here are the results:
I'll keep one node on |
Oh interesting. So it's possible a new commit sometime between Thank you for testing this and reporting the information! |
What's crazy is that there aren't any commits upstream to ceph between
|
kernel-6.0.18-300.fc37
Not crazy at all. We run into fittingly similar problem, when the FCOS update hit. Except we did not watched the problem from the client (kernel RBD) side. We had a 100% missing PGs after the upgrade, with OSDs toggling between UP and DOWN constantly. I don't have the logs anymore (stupid me), but what we observed was this:
My best guess is, that it's not a problem in the In the end we did rollback the OSD nodes to the previous release, while all other nodes followed through with the update. No problems since then, so we definitely would not blame the |
@punycode sounds like we have the same problem. We are still running FCOS 37.20230110.3.1. So this could be investigated, if anyone is interested. |
kernel-6.0.18-300.fc37
The fix for this went into |
Describe the bug
I have 2 master nodes and 1 worker node, all running Fedora CoreOS. All 3 are VMs. The two master nodes have two SR-IOV VFs from the host's ConnectX-3 NIC: a LAN interface and a BGP peering interface to a HA router pair via kube-vip. The worker node only has a LAN VF, SR-IOV as well but from an Intel I350 NIC. k8s was set up with kubespray and has been upgraded to kubespray v2.20.0, k8s v1.24.6. It is managed with flux2 and rook v1.10.10 is deployed to provide storage via ceph rbd using a NVMe drive on each node.
When one of my nodes upgraded via zincati to 37.20230110.3.1, I noticed that anything relying on storage was no longer functioning. There were many different errors from different points in the system.
The kernel said:
Jan 27 03:08:43 k8s-w-node3.lan kernel: libceph: osd2 (1)10.233.66.114:6800 socket closed (con state V1_BANNER)
I checked cilium thoroughly at this point, running through all of the troubleshooting steps to see if any traffic was being dropped. I ended up doing a pcap via toolbox on port 6800. There was plenty of traffic but occasionally a client would connect with a "v027" banner and get disconnected quickly. I don't know enough about ceph to comment further on this.
Although ceph said it was HEALTH_OK, an OSD was said to have "slow ops". It seemed to match up with the OSD on the upgraded node. Hopping into the rbdplugin container I noticed a bunch of hung fsck commands. I don't have this on my screen any longer.
Here are some bits of the kernel log from when things went bad:
I'd be happy to send a full log directly to someone or do any leg work to help debug this.
Reproduction steps
Expected behavior
Unicorns and rainbows
Actual behavior
Storage issues preventing proper operation/creation of pods
System details
Ignition config
No response
Additional information
No response
The text was updated successfully, but these errors were encountered: