-
Notifications
You must be signed in to change notification settings - Fork 772
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
2 nodes goes into NotReady state when 1 node goes to NotReady state in a HA cluster. #4596
Comments
Hello @mathnitin, Thank you for reporting your issue. From the inspection report, we can see that both your first node and your second node rebooted (at Aug 01 12:53:40 and Aug 01 13:04:06, respectively), which caused the microk8s snap to restart. When a node goes down, a re-election of the database leader node occurs based on the principles of the Raft algorithm. This re-election process happens over the network. Could you please describe the network glitch that led to this issue? |
@louiseschmidtgen For node3 we disconnected the netwrok adapter for the VM. We did not perform any operations on first node and second node. |
Thank you for the additional information @mathnitin. Would you be willing to reproduce this issue with additional flags enabled? Your help in resolving this issue is much appreciated! |
@louiseschmidtgen We tried a few options. We reduced the load on our systems, we are running just microk8s and 2 test pods ubuntu (reference: https://gist.github.com/lazypower/356747365cb80876b0b336e2b61b9c26) We are able to reproduce this on For collecting the DQLite logs, we tried on Node 2 Inspect report: Node 3 Inspect report: For this run, we disconnected the node3 network and all 3 nodes went into NotReady state after a few minutes. Recovery time is about 15 or so min as before.
Pod snapshot on the cluster
Also for this run, I described the node and collected the output
###Aditional information |
This run is microk8s v1.28.12, disconnected Node 1 network:
Attaching logs for node2, node3, when both are reporting NotReady state: inspection-report-20240802_114743_node3_NotReady.tar.gz These are the logs after node2, node3 recovered: The is logs for node1: |
Hi @mathnitin and @veenadong, thanks for helping us get to the bottom of this. @mathnitin, based on the EDIT: It looks like the cutoff in the journalctl logs is due to the limit set in microk8s' inspect script here. If these machines are still available, you could gather more complete logs by doing @veenadong, if you can trigger the issue repeatably, could you follow @louiseschmidtgen's instructions here to turn on dqlite tracing and run journalctl manually ( |
@cole-miller Juet recreated the issue on the setup. This time, I set the log level of dqlite to Timelines for attached inspect reports. They are approximate times and in PST. Inspect report of node1 and node 3 when all 3 nodes are in Not Ready State Node3: Inspect report of node1 and node 3 when node 1 and node 3 recovered Node3: Inspect report when all 3 nodes are in Ready State Node2: Node3: Please let us know if you need any other info |
Hi @mathnitin, Thank you for providing further inspection reports. We have been able to reproduce the issue on our end and are in the process of narrowing down the cause of the issue. We appreciate all your help! |
@louiseschmidtgen any update or recommendations for us to try? |
Not yet @mathnitin, we are still working on it. |
Hi @mathnitin, We’ve identified the issue and would appreciate it if you could try installing MicroK8s from the temporary channel Your assistance in helping us address this issue is greatly appreciated! |
@louiseschmidtgen Thanks for providing the patch. Yes, we will install and test it. We should have an update for you by tomorrow. |
@louiseschmidtgen We did our testing with the dev version and below our observations.
One question, is there a command we can use to find the dqlite leader at a given point of time. |
@mathnitin, thank you for your feedback on our dev version! We appreciate it and will take it into consideration for improving our solution. To find out who the dqlite leader is, you can run the following command:
|
@louiseschmidtgen We saw a new issue on the dev version. For one of our run when we disconnected the network microk8s looses HA. This is microk8s status of all the 3 nodes after we connect the network back. I don't know if we will be able to give you steps to recreate it, if we do will let you know.
|
Node Inspect reports |
Hi @mathnitin, thank you for reporting your new issue with the dev-fix including the inspection reports. Thank you for your patience and your help in improving our solution! |
Hi @mathnitin, could you please let us know which node you disconnected from the network? |
For the inspect report, we disconnected node 1. It being the dqlite leader cluster became unhealthy. |
Hello @mathnitin, Thank you for providing the additional details. Unfortunately, we weren't able to reproduce the issue on the dev version. I will be unavailable next week, but @berkayoz will be taking over and can help you in my absence. |
@louiseschmidtgen @berkayoz Also, do we have any insight on why data plane is lost for approx 30 sec? When we take the dqlite leader out, this spans for us over 1min 40sec.
|
@mathnitin How fast was this snapshot created? Could it be right after node 3 join operation? Could you provide more information(the deployment manifest etc.) and possible reproduction steps on the data plane connection/timeout issues you've mentioned? Thank you. |
@berkayoz Please see the comments inline.
The snapshot was created after making sure the cluster is in healthy state. However we are not able to recreate this issue.
Below is the nginx yaml file we are deploying. We have exposed the same nginx deployment with metallb as well.
The sample script we are using to check whether the data plane is operational or not.
Below is the NodePort script. We are making sure the node IP is not the node that we have brought down.
|
Hey @mathnitin, We are working toward a final fix and currently looking into I'll provide some comments related to the feedback you have provided on the dev version/possible fix.
I've run some tests regarding this, my findings are as follows:
I've tried to reproduce this with the
We could not reproduce this and we believe the issue is not related to the patch in the dev version. I'll keep updating here with new progress, let me know you have any other questions or observations. |
Hey @mathnitin I've looked more into your feedback and I have some extra comments.
I've stated previously there was an extra delay for a node that is also the dqlite leader. On testing, the first created node is usually the dqlite leader. Additionally this node will also be the leader for
The dqlite leader election happens pretty quickly. For
You can override these like
This will result in a quicker node fail-over and status detection. These changes should also reduce the period of failing requests in the nginx data plane testing. |
@berkayoz Thanks for the recommendation. We tried with the configuration changes. For network disconnect usecase, we are noticing that the Control plane detection for You are correct data plane loss is not a complete loss, these are intermittent failures. We would have assumed the failures will be in round-robin fashion, however these failures are consistent for few seconds in batches. Is there a way we can improve this? |
Hey @mathnitin, Kube-proxy in It could also be possible to declare a node NotReady faster by changing the |
Also see what appears to be the same issue here on v1.29 which we have been trying to bottom out. Happy to provide further logs or also test fixes if appropriate. |
Also seeing this on 1.29.4 deployments with 3 nodes. |
Hello @cs-dsmyth and @kcarson77, |
Hello @mathnitin, the fix is now in the MicroK8s Thank you for raising the issue and for providing data and helping with testing to reach the solution. |
Hello @mathnitin, I would like to point you to We will publish documentation on how to run kube-proxy in |
@louiseschmidtgen Can you please provide me the PR you merged in the dqlite repo and the microk8s 1.28 branch? We are following the https://discuss.kubernetes.io/t/howto-enable-fips-mode-operation/25067 steps to build the private snap package and realized the changes are not merged to the fips branch. |
Hello @mathnitin, This is the patch PR for the 1.28 (classic) microk8s: #4651. If you encounter any issues building the fips snap please open another issue and we will be happy to help you resolve them. |
Hi @mathnitin, if you are building the fips snap I would recommend pointing to k8s-dqlite the latest tag I hope your project goes well, thank you again for contributing to the fix I will be closing this issue. |
Summary
We have a 3 node Microk8s HA enabled cluster which is running microk8s version
1.28.7
. If one of the 3 nodes (say node3) experiences a power outage or network glitch and is not recoverable, another node (say node1) goes into NotReady state. For about 15+ minutes, node1 is in NotReady. This time can take up to 30 minutes sometime.What Should Happen Instead?
Only 1 node should be in the NotReady state. Other 2 nodes should be healthy and working.
Reproduction Steps
1.28.7
Introspection Report
Node 1 Inspect report
inspection-report-20240801_130021.tar.gz
Node 2 Inspect report
inspection-report-20240801_131139.tar.gz
Node 3 Inspect report
inspection-report-20240801_130117.tar.gz
Aditional information
Timelines for reference for the attached inspect report. They are approximately times and in PST.
Aug 1 12:40 <- node3 network went out manually triggered it.
Aug 1 12:41 <- node1 went in NotReady state.
Aug 1 12:56 <- node1 recovered.
Aug 1 12:59 <- node3 network was re-established
Aug 1 13:01 <- all nodes are in healthy state.
The text was updated successfully, but these errors were encountered: