Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

2 nodes goes into NotReady state when 1 node goes to NotReady state in a HA cluster. #4596

Closed
mathnitin opened this issue Aug 1, 2024 · 35 comments
Assignees

Comments

@mathnitin
Copy link

Summary

We have a 3 node Microk8s HA enabled cluster which is running microk8s version 1.28.7. If one of the 3 nodes (say node3) experiences a power outage or network glitch and is not recoverable, another node (say node1) goes into NotReady state. For about 15+ minutes, node1 is in NotReady. This time can take up to 30 minutes sometime.

What Should Happen Instead?

Only 1 node should be in the NotReady state. Other 2 nodes should be healthy and working.

Reproduction Steps

  1. Bring up 3 node HA cluster with the with microk8s 1.28.7
  2. Find out the node that has the most load and take down the network of that node.
  3. Observe kubectl get nodes to monitor status of nodes.

Introspection Report

Node 1 Inspect report
inspection-report-20240801_130021.tar.gz

Node 2 Inspect report
inspection-report-20240801_131139.tar.gz

Node 3 Inspect report
inspection-report-20240801_130117.tar.gz

Aditional information

Timelines for reference for the attached inspect report. They are approximately times and in PST.
Aug 1 12:40 <- node3 network went out manually triggered it.
Aug 1 12:41 <- node1 went in NotReady state.
Aug 1 12:56 <- node1 recovered.
Aug 1 12:59 <- node3 network was re-established
Aug 1 13:01 <- all nodes are in healthy state.

@louiseschmidtgen
Copy link
Contributor

Hello @mathnitin,

Thank you for reporting your issue.

From the inspection report, we can see that both your first node and your second node rebooted (at Aug 01 12:53:40 and Aug 01 13:04:06, respectively), which caused the microk8s snap to restart. When a node goes down, a re-election of the database leader node occurs based on the principles of the Raft algorithm. This re-election process happens over the network.

Could you please describe the network glitch that led to this issue?

@mathnitin
Copy link
Author

@louiseschmidtgen For node3 we disconnected the netwrok adapter for the VM. We did not perform any operations on first node and second node.

@louiseschmidtgen
Copy link
Contributor

Thank you for the additional information @mathnitin.

Would you be willing to reproduce this issue with additional flags enabled?
Please uncomment the flags LIBDQLITE_TRACE=1 and LIBRAFT_TRACE=1 in k8s-dqlite-env which is under /var/snap/microk8s/current/args/.

Your help in resolving this issue is much appreciated!

@mathnitin
Copy link
Author

mathnitin commented Aug 2, 2024

@louiseschmidtgen We tried a few options. We reduced the load on our systems, we are running just microk8s and 2 test pods ubuntu (reference: https://gist.github.com/lazypower/356747365cb80876b0b336e2b61b9c26) We are able to reproduce this on 1.28.7 and 1.28.12 both versions.

For collecting the DQLite logs, we tried on 1.28.7 Attached are the logs for same.
Node 1 Inspect report:
node-1-inspection-report-20240802_114324.tar.gz

Node 2 Inspect report:
node-2-inspection-report-20240802_114324.tar.gz

Node 3 Inspect report:
node-3-inspection-report-20240802_114340.tar.gz

For this run, we disconnected the node3 network and all 3 nodes went into NotReady state after a few minutes. Recovery time is about 15 or so min as before.

core@glop-nm-120-mem1:~$ kubectl get nodes
NAME                                     STATUS   ROLES    AGE   VERSION
glop-nm-120-mem1.glcpdev.cloud.hpe.com   Ready    <none>   18h   v1.28.7
glop-nm-120-mem2.glcpdev.cloud.hpe.com   Ready    <none>   17h   v1.28.7
glop-nm-120-mem3.glcpdev.cloud.hpe.com   Ready    <none>   17h   v1.28.7
core@glop-nm-120-mem1:~$ watch 'kubectl get nodes'
core@glop-nm-120-mem1:~$ kubectl get nodes
NAME                                     STATUS     ROLES    AGE   VERSION
glop-nm-120-mem1.glcpdev.cloud.hpe.com   NotReady   <none>   18h   v1.28.7
glop-nm-120-mem2.glcpdev.cloud.hpe.com   NotReady   <none>   17h   v1.28.7
glop-nm-120-mem3.glcpdev.cloud.hpe.com   NotReady   <none>   17h   v1.28.7
core@glop-nm-120-mem1:~$ << approx after 15 minutes >>
core@glop-nm-120-mem1:~$ kubectl get nodes
NAME                                     STATUS     ROLES    AGE   VERSION
glop-nm-120-mem1.glcpdev.cloud.hpe.com   Ready      <none>   18h   v1.28.7
glop-nm-120-mem2.glcpdev.cloud.hpe.com   Ready      <none>   18h   v1.28.7
glop-nm-120-mem3.glcpdev.cloud.hpe.com   NotReady   <none>   18h   v1.28.7

Pod snapshot on the cluster

$ kubectl get pods -A
NAMESPACE     NAME                                     READY   STATUS    RESTARTS      AGE
default       ubuntu1                                  1/1     Running   2 (17m ago)   17h
default       ubuntu2                                  1/1     Running   2 (17m ago)   17h
kube-system   calico-kube-controllers-77bd7c5b-mp5zw   1/1     Running   8 (17m ago)   18h
kube-system   calico-node-52jpp                        1/1     Running   8 (17m ago)   18h
kube-system   calico-node-cxtl4                        1/1     Running   8 (17m ago)   17h
kube-system   calico-node-tjjqw                        1/1     Running   3 (17m ago)   18h
kube-system   coredns-7998696dbd-2svgv                 1/1     Running   2 (17m ago)   17h
kube-system   coredns-7998696dbd-5p899                 1/1     Running   3 (17m ago)   17h
kube-system   coredns-7998696dbd-7xxpt                 1/1     Running   3 (17m ago)   17h
kube-system   metrics-server-848968bdcd-jkx6l          1/1     Running   8 (17m ago)   18h

Also for this run, I described the node and collected the output

$ kubectl describe nodes glop-nm-120-mem1.glcpdev.cloud.hpe.com
Name:               glop-nm-120-mem1.glcpdev.cloud.hpe.com
Roles:              <none>
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/os=linux
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=glop-nm-120-mem1.glcpdev.cloud.hpe.com
                    kubernetes.io/os=linux
                    microk8s.io/cluster=true
                    node.kubernetes.io/microk8s-controlplane=microk8s-controlplane
Annotations:        node.alpha.kubernetes.io/ttl: 0
                    projectcalico.org/IPv4Address: 10.245.244.122/24
                    projectcalico.org/IPv4VXLANTunnelAddr: 172.23.107.128
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Thu, 01 Aug 2024 17:12:54 -0700
Taints:             <none>
Unschedulable:      false
Lease:
  HolderIdentity:  glop-nm-120-mem1.glcpdev.cloud.hpe.com
  AcquireTime:     <unset>
  RenewTime:       Fri, 02 Aug 2024 11:25:37 -0700
Conditions:
  Type                 Status    LastHeartbeatTime                 LastTransitionTime                Reason              Message
  ----                 ------    -----------------                 ------------------                ------              -------
  NetworkUnavailable   False     Fri, 02 Aug 2024 11:10:00 -0700   Fri, 02 Aug 2024 11:10:00 -0700   CalicoIsUp          Calico is running on this node
  MemoryPressure       Unknown   Fri, 02 Aug 2024 11:25:25 -0700   Fri, 02 Aug 2024 11:24:05 -0700   NodeStatusUnknown   Kubelet stopped posting node status.
  DiskPressure         Unknown   Fri, 02 Aug 2024 11:25:25 -0700   Fri, 02 Aug 2024 11:24:05 -0700   NodeStatusUnknown   Kubelet stopped posting node status.
  PIDPressure          Unknown   Fri, 02 Aug 2024 11:25:25 -0700   Fri, 02 Aug 2024 11:24:05 -0700   NodeStatusUnknown   Kubelet stopped posting node status.
  Ready                Unknown   Fri, 02 Aug 2024 11:25:25 -0700   Fri, 02 Aug 2024 11:24:05 -0700   NodeStatusUnknown   Kubelet stopped posting node status.
Addresses:
  InternalIP:  10.245.244.122
  Hostname:    glop-nm-120-mem1.glcpdev.cloud.hpe.com
Capacity:
  cpu:                64
  ephemeral-storage:  551044160Ki
  hugepages-1Gi:      0
  hugepages-2Mi:      4Gi
  memory:             264105564Ki
  pods:               555
Allocatable:
  cpu:                64
  ephemeral-storage:  549995584Ki
  hugepages-1Gi:      0
  hugepages-2Mi:      4Gi
  memory:             259808860Ki
  pods:               555
System Info:
  Machine ID:                 76cff06500b64c5e9b9ff6d48dfb5413
  System UUID:                4216f49d-c05e-d63f-0763-b001fa41d910
  Boot ID:                    88483455-62d0-42f7-a00d-e6acded32ec9
  Kernel Version:             5.15.0-111-fips
  OS Image:                   Ubuntu 22.04.4 LTS
  Operating System:           linux
  Architecture:               amd64
  Container Runtime Version:  containerd://1.6.15
  Kubelet Version:            v1.28.7
  Kube-Proxy Version:         v1.28.7
Non-terminated Pods:          (4 in total)
  Namespace                   Name                                      CPU Requests  CPU Limits  Memory Requests  Memory Limits  Age
  ---------                   ----                                      ------------  ----------  ---------------  -------------  ---
  kube-system                 calico-kube-controllers-77bd7c5b-mp5zw    0 (0%)        0 (0%)      0 (0%)           0 (0%)         18h
  kube-system                 calico-node-tjjqw                         250m (0%)     0 (0%)      0 (0%)           0 (0%)         17h
  kube-system                 coredns-7998696dbd-7xxpt                  100m (0%)     100m (0%)   128Mi (0%)       128Mi (0%)     17h
  kube-system                 metrics-server-848968bdcd-jkx6l           100m (0%)     0 (0%)      200Mi (0%)       0 (0%)         18h
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests    Limits
  --------           --------    ------
  cpu                450m (0%)   100m (0%)
  memory             328Mi (0%)  128Mi (0%)
  ephemeral-storage  0 (0%)      0 (0%)
  hugepages-1Gi      0 (0%)      0 (0%)
  hugepages-2Mi      0 (0%)      0 (0%)
Events:
  Type    Reason          Age    From             Message
  ----    ------          ----   ----             -------
  Normal  RegisteredNode  2m14s  node-controller  Node glop-nm-120-mem1.glcpdev.cloud.hpe.com event: Registered Node glop-nm-120-mem1.glcpdev.cloud.hpe.com in Controller
  Normal  NodeNotReady    94s    node-controller  Node glop-nm-120-mem1.glcpdev.cloud.hpe.com status is now: NodeNotReady

###Aditional information
Timelines for reference for the attached inspect report. They are approximate times and in PST.
Aug 2 11:12+ <- node3 network went out manually triggered it.
Aug 2 11:20 <- All 3 nodes went in NotReady state.
Aug 2 11:38 <- node1 and node 2 recovered.
Aug 2 11:40 <- node3 network was re-established
Aug 2 11:41+ <- all nodes are in healthy state.

@veenadong
Copy link

veenadong commented Aug 2, 2024

This run is microk8s v1.28.12, disconnected Node 1 network:

`core@glop-nm-115-mem2:~$ kubectl get nodes -o wide
NAME                                     STATUS     ROLES    AGE   VERSION    INTERNAL-IP      EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION    CONTAINER-RUNTIME
glop-nm-115-mem1.glcpdev.cloud.hpe.com   NotReady   <none>   70m   v1.28.12   10.245.244.117   <none>        Ubuntu 22.04.4 LTS   5.15.0-111-fips   containerd://1.6.28
glop-nm-115-mem2.glcpdev.cloud.hpe.com   NotReady   <none>   59m   v1.28.12   10.245.244.118   <none>        Ubuntu 22.04.4 LTS   5.15.0-111-fips   containerd://1.6.28
glop-nm-115-mem3.glcpdev.cloud.hpe.com   NotReady   <none>   48m   v1.28.12   10.245.244.119   <none>        Ubuntu 22.04.4 LTS   5.15.0-111-fips   containerd://1.6.28`

Attaching logs for node2, node3, when both are reporting NotReady state:

inspection-report-20240802_114743_node3_NotReady.tar.gz
inspection-report-20240802_114654_node2_NotReady.tar.gz

These are the logs after node2, node3 recovered:
inspection-report-20240802_115338_node2_Ready.tar.gz
inspection-report-20240802_115356_node3_Ready.tar.gz

The is logs for node1:
inspection-report-20240802_120725_node1.tar.gz

@cole-miller
Copy link
Contributor

cole-miller commented Aug 5, 2024

Hi @mathnitin and @veenadong, thanks for helping us get to the bottom of this.

@mathnitin, based on the journal.log logs in your most recent comment, it seems like node3 was the dqlite cluster leader before being taken offline, and after that node2 won the election to become the next leader. The node1 logs indicate that by 11:23 the new leader is successfully replicating at least some transactions. Unfortunately, the node2 logs, which are the most important for determining why the cluster is NotReady after node3 goes down, are cut off before 11:40, at which point the NotReady period is already over. Perhaps the size or age limits for your journald are keeping those older logs from being retained? If you could collect logs that show the whole period of time between taking node3 offline and recovery on all three nodes it'd be invaluable!

EDIT: It looks like the cutoff in the journalctl logs is due to the limit set in microk8s' inspect script here. If these machines are still available, you could gather more complete logs by doing journalctl -u snap.microk8s.daemon-k8s-dqlite -S '2024-08-02 11:11:00' -U '2024-08-02 11:43:00' on each affected node.

@veenadong, if you can trigger the issue repeatably, could you follow @louiseschmidtgen's instructions here to turn on dqlite tracing and run journalctl manually (journalctl -u snap.microk8s.daemon-k8s-dqlite -S $start_of_incident -U $end_of_incident) to get complete logs? Thanks in advance!

@mathnitin
Copy link
Author

@cole-miller Juet recreated the issue on the setup. This time, I set the log level of dqlite to 2. We are reverting the machines to different states, so can't execute the journalctl command.

Timelines for attached inspect reports. They are approximate times and in PST.
Aug 5 15:20 <- node2 network went out manually triggered it.
Aug 5 15:21 <- All 3 nodes went in NotReady state.
Aug 5 15:36 <- node1 and node 3 recovered.
Aug 5 15:40 <- node2 network was re-established all nodes went in Ready state.

Inspect report of node1 and node 3 when all 3 nodes are in Not Ready State
Node1:
all-3-nodes-down-glop-nm-120-mem1.tar.gz

Node3:
all-3-nodes-down-glop-nm-120-mem3.tar.gz

Inspect report of node1 and node 3 when node 1 and node 3 recovered
Node1:
1-node-showing-down-glop-nm-120-mem1.tar.gz

Node3:
1-node-showing-down-glop-nm-120-mem3.tar.gz

Inspect report when all 3 nodes are in Ready State
Node1:
all-3-nodes-up-glop-nm-120-mem1.tar.gz

Node2:
all-3-nodes-up-glop-nm-120-mem2.tar.gz

Node3:
all-3-nodes-up-glop-nm-120-mem3.tar.gz

Please let us know if you need any other info

@louiseschmidtgen
Copy link
Contributor

Hi @mathnitin,

Thank you for providing further inspection reports. We have been able to reproduce the issue on our end and are in the process of narrowing down the cause of the issue.

We appreciate all your help!

@mathnitin
Copy link
Author

@louiseschmidtgen any update or recommendations for us to try?

@ktsakalozos
Copy link
Member

Not yet @mathnitin, we are still working on it.

@louiseschmidtgen
Copy link
Contributor

Hi @mathnitin,

We’ve identified the issue and would appreciate it if you could try installing MicroK8s from the temporary channel 1.28/edge/fix-ready. Please let us know if this resolves the problem on your end.

Your assistance in helping us address this issue is greatly appreciated!

@mathnitin
Copy link
Author

@louiseschmidtgen Thanks for providing the patch. Yes, we will install and test it. We should have an update for you by tomorrow.

@mathnitin
Copy link
Author

@louiseschmidtgen We did our testing with the dev version and below our observations.

  • When a node is disconnected we are not observing the other 2 nodes going in NotReady state.
  • With this fix, we are seeing a delay in the detection of nodes for "NotReady" state. The detection can take from 1 minute to 5 minutes. We are running watch on kubectl get nodes.
  • For data plane testing, we started 3 nginx pods with nodeantiaffinity and nodeport. We ran curl with timeout of 1 sec. For this test, we see connection timeouts of 10 sec to 30+ sec.

One question, is there a command we can use to find the dqlite leader at a given point of time.

@louiseschmidtgen
Copy link
Contributor

@mathnitin, thank you for your feedback on our dev version! We appreciate it and will take it into consideration for improving our solution.

To find out who the dqlite leader is, you can run the following command:

sudo -E /snap/microk8s/current/bin/dqlite -c /var/snap/microk8s/current/var/kubernetes/backend/cluster.crt -k /var/snap/microk8s/current/var/kubernetes/backend/cluster.key -s file:///var/snap/microk8s/current/var/kubernetes/backend/cluster.yaml k8s ".leader"

@mathnitin
Copy link
Author

@louiseschmidtgen We saw a new issue on the dev version. For one of our run when we disconnected the network microk8s looses HA. This is microk8s status of all the 3 nodes after we connect the network back. I don't know if we will be able to give you steps to recreate it, if we do will let you know.

  • Logs for Node 1.
root@glop-nm-110-mem1:/var/snap/microk8s/common# cat .microk8s.yaml <-- Launch config to bringup microk8s
version: 0.2.0
persistentClusterToken: d3b0c44298fc1c149afbf4c8996fb925
addons:
  - name: rbac
  - name: metrics-server
  - name: dns
extraCNIEnv:
  IPv4_CLUSTER_CIDR: 172.23.0.0/16
  IPv4_SERVICE_CIDR: 172.29.0.0/23
extraKubeAPIServerArgs:
  --service-node-port-range: 80-32767
extraKubeletArgs:
  --max-pods: 555
  --cluster-domain: cluster.local
  --cluster-dns: 172.29.0.10
extraContainerdArgs:
  --root: /data/var/lib/containerd/
  --state: /data/run/containerd/
extraSANs:
  - 172.29.0.1
  - 172.23.0.1
root@glop-nm-110-mem1:/var/snap/microk8s/common# microk8s status
microk8s is running
high-availability: no
  datastore master nodes: glop-nm-110-mem1.glcpdev.cloud.hpe.com:19001
  datastore standby nodes: none
addons:
  enabled:
    dns                  # (core) CoreDNS
    ha-cluster           # (core) Configure high availability on the current node
    helm                 # (core) Helm - the package manager for Kubernetes
    helm3                # (core) Helm 3 - the package manager for Kubernetes
    metallb              # (core) Loadbalancer for your Kubernetes cluster
    metrics-server       # (core) K8s Metrics Server for API access to service metrics
    rbac                 # (core) Role-Based Access Control for authorisation
  disabled:
    cert-manager         # (core) Cloud native certificate management
    cis-hardening        # (core) Apply CIS K8s hardening
    community            # (core) The community addons repository
    dashboard            # (core) The Kubernetes dashboard
    gpu                  # (core) Automatic enablement of Nvidia CUDA
    host-access          # (core) Allow Pods connecting to Host services smoothly
    hostpath-storage     # (core) Storage class; allocates storage from host directory
    ingress              # (core) Ingress controller for external access
    kube-ovn             # (core) An advanced network fabric for Kubernetes
    mayastor             # (core) OpenEBS MayaStor
    minio                # (core) MinIO object storage
    observability        # (core) A lightweight observability stack for logs, traces and metrics
    prometheus           # (core) Prometheus operator for monitoring and logging
    registry             # (core) Private image registry exposed on localhost:32000
    rook-ceph            # (core) Distributed Ceph storage using Rook
    storage              # (core) Alias to hostpath-storage add-on, deprecated
root@glop-nm-110-mem1:/var/snap/microk8s/common#
root@glop-nm-110-mem1:/var/snap/microk8s/common# sudo -E LD_LIBRARY_PATH=/snap/microk8s/current/usr/lib /snap/microk8s/current/bin/dqlite -c /var/snap/microk8s/current/var/kubernetes/backend/cluster.crt -k /var/snap/microk8s/current/var/kubernetes/backend/cluster.key -s file:///var/snap/microk8s/current/var/kubernetes/backend/cluster.yaml k8s ".leader"
glop-nm-110-mem1.glcpdev.cloud.hpe.com:19001
  • Logs for Node 2.
root@glop-nm-110-mem2:/var/snap/microk8s/common# cat .microk8s.yaml
version: 0.2.0
join:
  url: glop-nm-110-mem1.glcpdev.cloud.hpe.com:25000/d3b0c44298fc1c149afbf4c8996fb925
extraCNIEnv:
  IPv4_CLUSTER_CIDR: 172.23.0.0/16
  IPv4_SERVICE_CIDR: 172.29.0.0/23
extraKubeAPIServerArgs:
  --service-node-port-range: 80-32767
extraKubeletArgs:
  --max-pods: 555
  --cluster-domain: cluster.local
  --cluster-dns: 172.29.0.10
extraContainerdArgs:
  --root: /data/var/lib/containerd/
  --state: /data/run/containerd/
extraSANs:
  - 172.29.0.1
  - 172.23.0.1
root@glop-nm-110-mem2:/var/snap/microk8s/common# microk8s status
microk8s is running
high-availability: no
  datastore master nodes: glop-nm-110-mem1.glcpdev.cloud.hpe.com:19001
  datastore standby nodes: none
addons:
  enabled:
    dns                  # (core) CoreDNS
    ha-cluster           # (core) Configure high availability on the current node
    helm                 # (core) Helm - the package manager for Kubernetes
    helm3                # (core) Helm 3 - the package manager for Kubernetes
    metallb              # (core) Loadbalancer for your Kubernetes cluster
    metrics-server       # (core) K8s Metrics Server for API access to service metrics
    rbac                 # (core) Role-Based Access Control for authorisation
  disabled:
    cert-manager         # (core) Cloud native certificate management
    cis-hardening        # (core) Apply CIS K8s hardening
    community            # (core) The community addons repository
    dashboard            # (core) The Kubernetes dashboard
    gpu                  # (core) Automatic enablement of Nvidia CUDA
    host-access          # (core) Allow Pods connecting to Host services smoothly
    hostpath-storage     # (core) Storage class; allocates storage from host directory
    ingress              # (core) Ingress controller for external access
    kube-ovn             # (core) An advanced network fabric for Kubernetes
    mayastor             # (core) OpenEBS MayaStor
    minio                # (core) MinIO object storage
    observability        # (core) A lightweight observability stack for logs, traces and metrics
    prometheus           # (core) Prometheus operator for monitoring and logging
    registry             # (core) Private image registry exposed on localhost:32000
    rook-ceph            # (core) Distributed Ceph storage using Rook
    storage              # (core) Alias to hostpath-storage add-on, deprecated
root@glop-nm-110-mem2:/var/snap/microk8s/common# sudo -E LD_LIBRARY_PATH=/snap/microk8s/current/usr/lib /snap/microk8s/current/bin/dqlite -c /var/snap/microk8s/current/var/kubernetes/backend/cluster.crt -k /var/snap/microk8s/current/var/kubernetes/backend/cluster.key -s file:///var/snap/microk8s/current/var/kubernetes/backend/cluster.yaml k8s ".leader"
glop-nm-110-mem1.glcpdev.cloud.hpe.com:19001
  • Logs for Node 3.
root@glop-nm-110-mem3:/var/snap/microk8s/common# cat .microk8s.yaml
version: 0.2.0
join:
  url: glop-nm-110-mem1.glcpdev.cloud.hpe.com:25000/d3b0c44298fc1c149afbf4c8996fb925
extraCNIEnv:
  IPv4_CLUSTER_CIDR: 172.23.0.0/16
  IPv4_SERVICE_CIDR: 172.29.0.0/23
extraKubeAPIServerArgs:
  --service-node-port-range: 80-32767
extraKubeletArgs:
  --max-pods: 555
  --cluster-domain: cluster.local
  --cluster-dns: 172.29.0.10
extraContainerdArgs:
  --root: /data/var/lib/containerd/
  --state: /data/run/containerd/
extraSANs:
  - 172.29.0.1
  - 172.23.0.1
root@glop-nm-110-mem3:/home/core# microk8s status
microk8s is running

high-availability: no
  datastore master nodes: none
  datastore standby nodes: none
addons:
  enabled:
    dns                  # (core) CoreDNS
    ha-cluster           # (core) Configure high availability on the current node
    helm                 # (core) Helm - the package manager for Kubernetes
    helm3                # (core) Helm 3 - the package manager for Kubernetes
    metallb              # (core) Loadbalancer for your Kubernetes cluster
    metrics-server       # (core) K8s Metrics Server for API access to service metrics
    rbac                 # (core) Role-Based Access Control for authorisation
  disabled:
    cert-manager         # (core) Cloud native certificate management
    cis-hardening        # (core) Apply CIS K8s hardening
    community            # (core) The community addons repository
    dashboard            # (core) The Kubernetes dashboard
    gpu                  # (core) Automatic enablement of Nvidia CUDA
    host-access          # (core) Allow Pods connecting to Host services smoothly
    hostpath-storage     # (core) Storage class; allocates storage from host directory
    ingress              # (core) Ingress controller for external access
    kube-ovn             # (core) An advanced network fabric for Kubernetes
    mayastor             # (core) OpenEBS MayaStor
    minio                # (core) MinIO object storage
    observability        # (core) A lightweight observability stack for logs, traces and metrics
    prometheus           # (core) Prometheus operator for monitoring and logging
    registry             # (core) Private image registry exposed on localhost:32000
    rook-ceph            # (core) Distributed Ceph storage using Rook
    storage              # (core) Alias to hostpath-storage add-on, deprecated
root@glop-nm-110-mem3:/home/core#
root@glop-nm-110-mem3:/home/core#
root@glop-nm-110-mem3:/home/core# sudo -E LD_LIBRARY_PATH=/snap/microk8s/current/usr/lib /snap/microk8s/current/bin/dqlite -c /var/snap/microk8s/current/var/kubernetes/backend/cluster.crt -k /var/snap/microk8s/current/var/kubernetes/backend/cluster.key -s file:///var/snap/microk8s/current/var/kubernetes/backend/cluster.yaml k8s ".leader"
glop-nm-110-mem1.glcpdev.cloud.hpe.com:19001
root@glop-nm-110-mem3:/home/core# microk8s kubectl get nodes
NAME                                     STATUS   ROLES    AGE   VERSION
glop-nm-110-mem1.glcpdev.cloud.hpe.com   Ready    <none>   46h   v1.28.12
glop-nm-110-mem2.glcpdev.cloud.hpe.com   Ready    <none>   46h   v1.28.12
glop-nm-110-mem3.glcpdev.cloud.hpe.com   Ready    <none>   17h   v1.28.12
root@glop-nm-110-mem3:/home/core#

@mathnitin
Copy link
Author

@louiseschmidtgen
Copy link
Contributor

Hi @mathnitin,

thank you for reporting your new issue with the dev-fix including the inspection reports.
We are taking a careful look at your logs and are trying to create a reproducer ourselves.

Thank you for your patience and your help in improving our solution!

@louiseschmidtgen
Copy link
Contributor

Hi @mathnitin, could you please let us know which node you disconnected from the network?

@mathnitin
Copy link
Author

For the inspect report, we disconnected node 1. It being the dqlite leader cluster became unhealthy.
What our observation is that the cluster should be running in HA mode somehow this cluster lost HA and only node 1 is recognized. The get nodes does show all 3 are part of k8s cluster

@louiseschmidtgen
Copy link
Contributor

Hello @mathnitin,

Thank you for providing the additional details. Unfortunately, we weren't able to reproduce the issue on the dev version.
Could you please share the exact steps we need to follow to reproduce it?
On our side, removing and re-joining the node from the cluster doesn’t seem to trigger the failure.

I will be unavailable next week, but @berkayoz will be taking over and can help you in my absence.

@mathnitin
Copy link
Author

@louiseschmidtgen @berkayoz
We tried to reproduce the HA disconnect on our side. We are also not able to reproduce it. We think somehow our VMware snapshot was corrupted as every time we go to the snapshot, we are seeing this issue. Are we not able to find anything from the inspect?

Also, do we have any insight on why data plane is lost for approx 30 sec? When we take the dqlite leader out, this spans for us over 1min 40sec.

For data plane testing, we started 3 nginx pods with nodeantiaffinity and nodeport. We ran curl with timeout of 1 sec. For this test, we see connection timeouts of 10 sec to 30+ sec.

@berkayoz
Copy link
Member

@mathnitin
From the inspection reports we are seeing cluster.yaml that contains the dqlite members is missing the node3, this is consistent across all 3 nodes. This could be related to a disturbance/issue happening while the 3rd node was joining. There is a small period between a joining node getting accepted and cluster.yaml files being updated on the cluster members. Since we can observe the 3rd node in the kubernetes cluster the join operation was successful but possibly cluster.yaml could not get updated in time.

How fast was this snapshot created? Could it be right after node 3 join operation?

Could you provide more information(the deployment manifest etc.) and possible reproduction steps on the data plane connection/timeout issues you've mentioned?

Thank you.

@mathnitin
Copy link
Author

mathnitin commented Aug 20, 2024

@berkayoz Please see the comments inline.

How fast was this snapshot created? Could it be right after node 3 join operation?

The snapshot was created after making sure the cluster is in healthy state. However we are not able to recreate this issue.

Could you provide more information(the deployment manifest etc.) and possible reproduction steps on the data plane connection/timeout issues you've mentioned?

Below is the nginx yaml file we are deploying. We have exposed the same nginx deployment with metallb as well.

$ cat nginx-service.yaml
apiVersion: v1
kind: Service
metadata:
  name: my-nginx-service
spec:
  type: NodePort
  selector:
    app: my-nginx
  ports:
  - protocol: TCP
    port: 80
    targetPort: 80

$ cat nginx-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-nginx
spec:
  replicas: 3
  selector:
    matchLabels:
      app: my-nginx
  template:
    metadata:
      labels:
        app: my-nginx
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            - labelSelector:
                matchExpressions:
                - key: app
                  operator: In
                  values:
                  - my-nginx
              topologyKey: kubernetes.io/hostname
      containers:
      - name: nginx
        image: nginx:latest
        ports:
        - containerPort: 80

The sample script we are using to check whether the data plane is operational or not.
Below is the Metallb script.

#!/bin/bash

# URL to check
url="http://<VIP>"

# Counter for non-200 responses
non_200_count=0

while true
do
  # Perform the request
  status_code=$(curl --connect-timeout 1 -m 1 -s -o /dev/null -w "%{http_code}" "$url")
  echo $status_code
  echo $url
  date

  # Check if the status code is not 200
  if [ "$status_code" != "200" ]; then
    echo "Response code: $status_code"
    echo $url
    date
    non_200_count=$((non_200_count + 1))
  fi

  sleep 1
  if [ "$non_200_count" == "1000" ]; then
    # Print the count of non-200 responses
    echo "Count of non-200 responses: $non_200_count"
    break
  fi
done

# Print the count of non-200 responses
echo "Count of non-200 responses: $non_200_count"

Below is the NodePort script. We are making sure the node IP is not the node that we have brought down.

#!/bin/bash

# URL to check
url="http://<NODE_IP>:<NODE_PORT>"

# Counter for non-200 responses
non_200_count=0

while true
do
  # Perform the request
  status_code=$(curl --connect-timeout 1 -m 1 -s -o /dev/null -w "%{http_code}" "$url")
  echo $url
  echo $status_code
  date

  # Check if the status code is not 200
  if [ "$status_code" != "200" ]; then
    echo "Response code: $status_code"
    echo $url
    date
    non_200_count=$((non_200_count + 1))
  fi

  sleep 1
  if [ "$non_200_count" == "1000" ]; then
    # Print the count of non-200 responses
    echo "Count of non-200 responses: $non_200_count"
    break
  fi
done

# Print the count of non-200 responses
echo "Count of non-200 responses: $non_200_count"

@berkayoz
Copy link
Member

Hey @mathnitin,

We are working toward a final fix and currently looking into go-dqlite side of things with the team.

I'll provide some comments related to the feedback you have provided on the dev version/possible fix.

With this fix, we are seeing a delay in the detection of nodes for "NotReady" state. The detection can take from 1 minute to 5 minutes. We are running watch on kubectl get nodes.

I've run some tests regarding this, my findings are as follows:

  • The delay in detection of the NotReady or Ready state for a node(s) that is not the dqlite leader is ~40s which is aligned with the kubernetes default.
  • The delay in detection for a node that is also the dqlite leader is more, I've been observing around 80s-120s. This might be related to datastore not being available while leader re-election is happening which might lead to missing/dropping detection cycles. I am looking more into this situation.

For data plane testing, we started 3 nginx pods with nodeantiaffinity and nodeport. We ran curl with timeout of 1 sec. For this test, we see connection timeouts of 10 sec to 30+ sec.

I've tried to reproduce this with the NodePort approach. My observations for this are as follows:

  • We can observe some requests failing/timing out when a node is taken down, however a total loss of the service is not observed. This seems to be related to how services are handled in Kubernetes. Since a Kubernetes service performs load-balancing(through kube-proxy) to matching pods some requests are being made to the pod on the node that is taken down until the node is marked as NotReady and the pod is removed from the service selector.
  • Taking down a non-dqlite leader node, the period of failing requests aligns with the ~40s default detection time.
  • Taking down a dqlite leader node, the period of failing requests aligns with the more delayed detection of the NotReady which is explained above.

We saw a new issue on the dev version. For one of our runs when we disconnected the network microk8s looses HA.

We could not reproduce this and we believe the issue is not related to the patch in the dev version.

I'll keep updating here with new progress, let me know you have any other questions or observations.

@berkayoz
Copy link
Member

Hey @mathnitin

I've looked more into your feedback and I have some extra comments.

With this fix, we are seeing a delay in the detection of nodes for "NotReady" state. The detection can take from 1 minute to 5 minutes. We are running watch on kubectl get nodes.

I've stated previously there was an extra delay for a node that is also the dqlite leader. On testing, the first created node is usually the dqlite leader. Additionally this node will also be the leader for kube-controller-manager and kube-scheduler components. Taking down this node leads to multiple leader elections.

  • The dqlite cluster will perform a leader election since the leader is taken down.
  • kube-controller-manager will perform a leader election, and will have to wait for datastore to settle first since leases are used.
  • kube-scheduler wil perform a leader election, and will have to wait for datastore to settle first since leases are used.

The dqlite leader election happens pretty quickly. For kube-controller-manager and kube-scheduler, MicroK8s adjusts the leader election configuration in these components to lower resource consumption. These adjustments are

  • --leader-elect-lease-duration=60s
  • --leader-elect-renew-deadline=30s

You can override these like --leader-elect-lease-duration=15s and --leader-elect-renew-deadline=10s to match Kubernetes defaults in the following files:

  • /var/snap/microk8s/current/args/kube-scheduler
  • /var/snap/microk8s/current/args/kube-controller-manager

This will result in a quicker node fail-over and status detection.

These changes should also reduce the period of failing requests in the nginx data plane testing.

@mathnitin
Copy link
Author

@berkayoz Thanks for the recommendation. We tried with the configuration changes. For network disconnect usecase, we are noticing that the Control plane detection for NotReady state is faster. The data plane loss numbers remain the same.

You are correct data plane loss is not a complete loss, these are intermittent failures. We would have assumed the failures will be in round-robin fashion, however these failures are consistent for few seconds in batches. Is there a way we can improve this?

@berkayoz
Copy link
Member

Hey @mathnitin,

Kube-proxy in iptables mode selects the endpoint randomly, kube-proxy in ipvs mode has more options for load-balancing and uses round robin by default. This might match the round-robin expectations and might improve the failures. We are working on testing this change, providing how-to steps and more info related to this change. We will update here with a follow up comment on this.

It could also be possible to declare a node NotReady faster by changing the --node-monitor-grace-period kube-controller-manager flag. This is 40s by default, in alignment with the upstream value. Lowering this value could reduce the request failure period but could result in an undesired side effect if lowered too much.

@cs-dsmyth
Copy link

Also see what appears to be the same issue here on v1.29 which we have been trying to bottom out.
Easily reproducible, sometimes with all 3 nodes going NotReady for similar time as above.

Happy to provide further logs or also test fixes if appropriate.

@kcarson77
Copy link

Also seeing this on 1.29.4 deployments with 3 nodes.
As above, can provide config, logs or test potential fixes.

@louiseschmidtgen
Copy link
Contributor

Hello @cs-dsmyth and @kcarson77,
we are back-porting the fix into all supported microk8s versions (1.28-1.31).

@louiseschmidtgen louiseschmidtgen self-assigned this Aug 30, 2024
@louiseschmidtgen
Copy link
Contributor

Hello @mathnitin,

the fix is now in the MicroK8s 1.28/stable channel. @cs-dsmyth, @kcarson77 For MicroK8s channels 1.28-strict to latest the fix will make its way from beta into the stable channel by the beginning of next week.

Thank you for raising the issue and for providing data and helping with testing to reach the solution.

@louiseschmidtgen
Copy link
Contributor

louiseschmidtgen commented Sep 11, 2024

Hello @mathnitin,

I would like to point you to ipvs kube-proxy mode to address the intermittent failures you are seeing when a node is removed from your cluster. I have tested ipvs mode with your nginx scripts on a dev snap and can confirm that the failures are in round-robin fashion. Unfortunately, currently ipvs mode does not work on MicroK8s 1.28 due to a Calico issue with ipset which is addressed in a newer Calico version that will land with MicroK8s 1.32.

We will publish documentation on how to run kube-proxy in ipvs mode in MicroK8s 1.32.

@mathnitin
Copy link
Author

@louiseschmidtgen Can you please provide me the PR you merged in the dqlite repo and the microk8s 1.28 branch? We are following the https://discuss.kubernetes.io/t/howto-enable-fips-mode-operation/25067 steps to build the private snap package and realized the changes are not merged to the fips branch.

@louiseschmidtgen
Copy link
Contributor

Hello @mathnitin,

This is the patch PR for the 1.28 (classic) microk8s: #4651.
This is the new tag in dqlite v1.1.11 with the patch: canonical/k8s-dqlite#161.
MicroK8s fips branch points to k8s-dqlite master (which has the fix): https://github.com/canonical/microk8s/blob/fips/build-scripts/components/k8s-dqlite/version.sh.

If you encounter any issues building the fips snap please open another issue and we will be happy to help you resolve them.

@louiseschmidtgen
Copy link
Contributor

louiseschmidtgen commented Sep 26, 2024

Hi @mathnitin,

if you are building the fips snap I would recommend pointing to k8s-dqlite the latest tag v1.2.0 instead of master. As master is under development.

I hope your project goes well, thank you again for contributing to the fix I will be closing this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants