Troubleshooting Controller Freezes #10211

GuillaumeDorschner · 2023-07-18T12:26:42Z

What happened:

The controller appears to crash sporadically, with the duration between crashes varying between a day and a week. During these crashes, the controller stops functioning entirely: it ceases to produce logs and fails to route. Restarting the controller usually remedies the issue; however, on some occasions, I need to reset the entire cluster to restore functionality.

What you expected to happen:

The controller should consistently and properly route to the service.

NGINX Ingress controller version (exec into the pod and run nginx-ingress-controller --version.):

-------------------------------------------------------------------------------
NGINX Ingress controller
  Release:       v1.8.1
  Build:         dc88dce9ea5e700f3301d16f971fa17c6cfe757d
  Repository:    https://github.com/kubernetes/ingress-nginx
  nginx version: nginx/1.21.6

-------------------------------------------------------------------------------

Kubernetes version :

WARNING: This version information is deprecated and will be replaced with the output from kubectl version --short.  Use --output=yaml|json to get the full version.
Client Version: version.Info{Major:"1", Minor:"27", GitVersion:"v1.27.2", GitCommit:"7f6f68fdabc4df88cfea2dcf9a19b2b830f1e647", GitTreeState:"clean", BuildDate:"2023-05-17T14:20:07Z", GoVersion:"go1.20.4", Compiler:"gc", Platform:"linux/amd64"}
Kustomize Version: v5.0.1
Server Version: version.Info{Major:"1", Minor:"27", GitVersion:"v1.27.3", GitCommit:"25b4e43193bcda6c7328a6d147b1fb73a33f1598", GitTreeState:"clean", BuildDate:"2023-06-14T09:47:40Z", GoVersion:"go1.20.5", Compiler:"gc", Platform:"linux/amd64"}

Environment:

Cloud provider or hardware configuration: on premise
OS (e.g. from /etc/os-release): almalinux 8.8
Kernel (e.g. uname -a): Linux master 4.18.0-477.13.1.el8_8.x86_64 Basic structure #1 SMP Tue May 30 14:53:41 EDT 2023 x86_64 x86_64 x86_64 GNU/Linux
Install tools:
kubeadm
flannel
metallb
longhorn
ingress nginx
Basic cluster related info:
- kubectl version

WARNING: This version information is deprecated and will be replaced with the output from kubectl version --short.  Use --output=yaml|json to get the full version.
Client Version: version.Info{Major:"1", Minor:"27", GitVersion:"v1.27.2", GitCommit:"7f6f68fdabc4df88cfea2dcf9a19b2b830f1e647", GitTreeState:"clean", BuildDate:"2023-05-17T14:20:07Z", GoVersion:"go1.20.4", Compiler:"gc", Platform:"linux/amd64"}
Kustomize Version: v5.0.1
Server Version: version.Info{Major:"1", Minor:"27", GitVersion:"v1.27.3", GitCommit:"25b4e43193bcda6c7328a6d147b1fb73a33f1598", GitTreeState:"clean", BuildDate:"2023-06-14T09:47:40Z", GoVersion:"go1.20.5", Compiler:"gc", Platform:"linux/amd64"}

kubectl get nodes -o wide

NAME     STATUS   ROLES           AGE   VERSION   INTERNAL-IP      EXTERNAL-IP   OS-IMAGE                           KERNEL-VERSION                 CONTAINER-RUNTIME
master   Ready    control-plane   31m   v1.27.2   192.168.137.60   <none>        AlmaLinux 8.8 (Sapphire Caracal)   4.18.0-477.13.1.el8_8.x86_64   containerd://1.6.2

How was the ingress-nginx-controller installed:

The controller was installed via the following commands for both cloud and baremetal configurations:

kubectl apply -f https://raw.githubusercontent.com/kubernetes/ingress-nginx/main/deploy/static/provider/cloud/deploy.yaml

or

 kubectl apply -f https://raw.githubusercontent.com/kubernetes/ingress-nginx/main/deploy/static/provider/baremetal/deploy.yaml

then change the service to the LoadBalancer

Current State of the controller:
- kubectl describe ingressclasses

Name:         nginx
Labels:       app.kubernetes.io/component=controller
              app.kubernetes.io/instance=ingress-nginx
              app.kubernetes.io/name=ingress-nginx
              app.kubernetes.io/part-of=ingress-nginx
              app.kubernetes.io/version=1.8.1
Annotations:  <none>
Controller:   k8s.io/ingress-nginx
Events:       <none>

The Ingresses

┌─────────────────────────────────────────────────────────────────────────────────────────────────────────── Ingresses(all)[4] ────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ NAMESPACE↑                      NAME                               CLASS                       HOSTS                                         ADDRESS                             PORTS                        AGE                        │
│ default                         keycloak                           nginx                       auth.labo.bi                                  192.168.137.60                      80, 443                      23m                        │
│ default                         minio                              <none>                      minio.labo.bi                                 192.168.137.60                      80, 443                      18m                        │
│ default                         minio-console                      <none>                      minio-console.labo.bi                         192.168.137.60                      80, 443                      18m                        │
│ default                         nginx-ingress                      nginx                       web.labo.bi,test.labo.bi                      192.168.137.60                      80, 443                      33m                        │
│

The text was updated successfully, but these errors were encountered:

k8s-ci-robot · 2023-07-18T12:26:49Z

This issue is currently awaiting triage.

If Ingress contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

longwuyuan · 2023-07-18T16:37:12Z

/remove-kind bug

You can look at other info like kubectl get events -A

GuillaumeDorschner · 2023-07-19T07:08:37Z

I'm getting this

╭─root@master ~
╰─# kubectl get events -A
NAMESPACE        LAST SEEN   TYPE      REASON             OBJECT                               MESSAGE
kube-flannel     5m8s        Warning   DNSConfigForming   pod/kube-flannel-ds-zclhl            Nameserver limits were exceeded, some nameservers have been omitted, the applied nameserver line is: 192.168.1.1 1.1.1.1 198.168.137.1
kube-system      4m44s       Warning   DNSConfigForming   pod/coredns-5d78c9869d-cbdbf         Nameserver limits were exceeded, some nameservers have been omitted, the applied nameserver line is: 192.168.1.1 1.1.1.1 198.168.137.1
kube-system      14s         Warning   DNSConfigForming   pod/coredns-5d78c9869d-lzmzs         Nameserver limits were exceeded, some nameservers have been omitted, the applied nameserver line is: 192.168.1.1 1.1.1.1 198.168.137.1
kube-system      1s          Warning   DNSConfigForming   pod/etcd-master                      Nameserver limits were exceeded, some nameservers have been omitted, the applied nameserver line is: 192.168.1.1 1.1.1.1 198.168.137.1
kube-system      5m42s       Warning   DNSConfigForming   pod/kube-apiserver-master            Nameserver limits were exceeded, some nameservers have been omitted, the applied nameserver line is: 192.168.1.1 1.1.1.1 198.168.137.1
kube-system      4m24s       Warning   DNSConfigForming   pod/kube-controller-manager-master   Nameserver limits were exceeded, some nameservers have been omitted, the applied nameserver line is: 192.168.1.1 1.1.1.1 198.168.137.1
kube-system      5m17s       Warning   DNSConfigForming   pod/kube-proxy-d2tmd                 Nameserver limits were exceeded, some nameservers have been omitted, the applied nameserver line is: 192.168.1.1 1.1.1.1 198.168.137.1
kube-system      36s         Warning   DNSConfigForming   pod/kube-scheduler-master            Nameserver limits were exceeded, some nameservers have been omitted, the applied nameserver line is: 192.168.1.1 1.1.1.1 198.168.137.1
metallb-system   3m5s        Warning   DNSConfigForming   pod/speaker-v4vwl                    Nameserver limits were exceeded, some nameservers have been omitted, the applied nameserver line is: 192.168.1.1 1.1.1.1 198.168.137.1

strongjz · 2023-07-20T15:29:22Z

Are you making changes to the dns settings? It looks like there changes with 1.1.1.1 in the settings? Do you have more than 3 in the config?

Ingress-nginx uses whatever settings are available in the cluster.

There is a known issue of tyring to apply more than 3 https://kubernetes.io/docs/tasks/administer-cluster/dns-debugging-resolution/#known-issues

Linux's libc (a.k.a. glibc) has a limit for the DNS nameserver records to 3 by default and Kubernetes needs to consume 1 nameserver record.

GuillaumeDorschner · 2023-07-24T14:18:55Z

Thank you for the tips @strongjz; I changed the DNS settings, and now I've got this in my /etc/resolv.conf file:

/etc/resolv.conf

nameserver 192.168.1.1
nameserver 198.168.137.1

However, my ingress doesn't seem to work, and I'm not sure why. I also noticed that I have a self-signed certificate. Could that be interfering with the ingress functionality?

My logs

│ -------------------------------------------------------------------------------                                                                                                                                                          │
│ NGINX Ingress controller                                                                                                                                                                                                                 │
│   Release:       v1.8.1                                                                                                                                                                                                                  │
│   Build:         dc88dce9ea5e700f3301d16f971fa17c6cfe757d                                                                                                                                                                                │
│   Repository:    https://github.com/kubernetes/ingress-nginx                                                                                                                                                                             │
│   nginx version: nginx/1.21.6                                                                                                                                                                                                            │
│ -------------------------------------------------------------------------------                                                                                                                                                          │
│ W0724 14:08:15.407860       7 client_config.go:618] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.                                                                                   │
│ I0724 14:08:15.407983       7 main.go:209] "Creating API client" host="https://10.96.0.1:443"                                                                                                                                            │
│ I0724 14:08:15.417764       7 main.go:253] "Running in Kubernetes cluster" major="1" minor="27" git="v1.27.4" state="clean" commit="fa3d7990104d7c1f16943a67f11b154b71f6a132" platform="linux/amd64"                                     │
│ I0724 14:08:15.642830       7 main.go:104] "SSL fake certificate created" file="/etc/ingress-controller/ssl/default-fake-certificate.pem"                                                                                                │
│ I0724 14:08:15.681703       7 ssl.go:533] "loading tls certificate" path="/usr/local/certificates/cert" key="/usr/local/certificates/key"                                                                                                │
│ I0724 14:08:15.696019       7 nginx.go:261] "Starting NGINX Ingress controller"                                                                                                                                                          │
│ I0724 14:08:15.711776       7 event.go:285] Event(v1.ObjectReference{Kind:"ConfigMap", Namespace:"ingress-nginx", Name:"ingress-nginx-controller", UID:"3d7d77de-6d33-4a66-9381-e43e3167b929", APIVersion:"v1", ResourceVersion:"698", F │
│ ieldPath:""}): type: 'Normal' reason: 'CREATE' ConfigMap ingress-nginx/ingress-nginx-controller                                                                                                                                          │
│ I0724 14:08:16.799729       7 store.go:432] "Found valid IngressClass" ingress="default/example" ingressclass="nginx"                                                                                                                    │
│ I0724 14:08:16.800230       7 event.go:285] Event(v1.ObjectReference{Kind:"Ingress", Namespace:"default", Name:"example", UID:"f684aaf3-b76b-4fb3-805d-1ee941e86127", APIVersion:"networking.k8s.io/v1", ResourceVersion:"2291", FieldPa │
│ th:""}): type: 'Normal' reason: 'Sync' Scheduled for sync                                                                                                                                                                                │
│ I0724 14:08:16.897374       7 nginx.go:304] "Starting NGINX process"                                                                                                                                                                     │
│ I0724 14:08:16.897465       7 leaderelection.go:248] attempting to acquire leader lease ingress-nginx/ingress-nginx-leader...                                                                                                            │
│ I0724 14:08:16.898026       7 nginx.go:324] "Starting validation webhook" address=":8443" certPath="/usr/local/certificates/cert" keyPath="/usr/local/certificates/key"                                                                  │
│ I0724 14:08:16.898703       7 controller.go:190] "Configuration changes detected, backend reload required"                                                                                                                               │
│ I0724 14:08:16.900720       7 status.go:84] "New leader elected" identity="ingress-nginx-controller-5c778bffff-hmhkm"                                                                                                                    │
│ I0724 14:08:16.994648       7 controller.go:207] "Backend successfully reloaded"                                                                                                                                                         │
│ I0724 14:08:16.994825       7 controller.go:218] "Initial sync, sleeping for 1 second"                                                                                                                                                   │
│ I0724 14:08:16.994929       7 event.go:285] Event(v1.ObjectReference{Kind:"Pod", Namespace:"ingress-nginx", Name:"ingress-nginx-controller-5c778bffff-8p5d2", UID:"a25d1e79-78e0-4f78-a538-c26facd3cf21", APIVersion:"v1", ResourceVersi │
│ on:"2331", FieldPath:""}): type: 'Normal' reason: 'RELOAD' NGINX reload triggered due to a change in configuration                                                                                                                       │
│ I0724 14:08:54.092960       7 leaderelection.go:258] successfully acquired lease ingress-nginx/ingress-nginx-leader                                                                                                                      │
│ I0724 14:08:54.093070       7 status.go:84] "New leader elected" identity="ingress-nginx-controller-5c778bffff-8p5d2"                                                                                                                    │
│ I0724 14:08:54.107062       7 status.go:300] "updating Ingress status" namespace="default" ingress="example" currentValue=null newValue=[{"ip":"192.168.137.61"}]                                                                        │
│ I0724 14:08:54.112575       7 event.go:285] Event(v1.ObjectReference{Kind:"Ingress", Namespace:"default", Name:"example", UID:"f684aaf3-b76b-4fb3-805d-1ee941e86127", APIVersion:"networking.k8s.io/v1", ResourceVersion:"2415", FieldPa │
│ th:""}): type: 'Normal' reason: 'Sync' Scheduled for sync

strongjz · 2023-07-25T11:15:11Z

Those logs look fine for a controllerstartup. Can you be more precise with what's not working?

GuillaumeDorschner · 2023-07-26T08:05:36Z

@strongjz, I apologize for the late response. I had to wait for the error to reproduce. After a certain duration, it appears that nginx is not performing its duties properly. The pod is running but there are no insights in the logs. I'm unable to see the GET request of a pod (tested using curl or a browser, from both inside and outside the cluster). For your information, I'm using flannel, metallb, and cert-manager with my self-signed certificate.

Here are the outcomes of my traceroute and curl tests:

Traceroute

From outside the cluster:

╭─root@Admin-Pc ~
╰─# traceroute test.labo.bi
traceroute to test.labo.bi (192.168.137.10), 30 hops max, 60 byte packets
 1  * * *
 2  * * *
 3  * * *
 4  * * *
 5  * * *
 6  *^C

From inside the cluster:

╭─root@master ~
╰─# traceroute test.labo.bi
traceroute to test.labo.bi (192.168.137.10), 30 hops max, 60 byte packets
 1  worker1 (192.168.137.61)  0.498 ms  0.458 ms  0.432 ms
 2  * * *
 3  * * *
 4  * * *
 5  * * *
 6  * * *
 7  *^C

Curl

From outside the cluster:

╭─root@Admin-Pc ~
╰─# curl test.labo.bi
curl: (7) Failed to connect to test.labo.bi port 80: Connection timed out

From inside the cluster:

╭─root@Admin-Pc ~
╰─# curl test.labo.bi
<html>
<head><title>308 Permanent Redirect</title></head>
<body>
<center><h1>308 Permanent Redirect</h1></center>
<hr><center>nginx</center>
</body>
</html>

this give me the following logs:

┌────────────────────────────────────────────────────────────────────────────── Logs(ingress-nginx/ingress-nginx-controller-5c778bffff-jn7hv:controller)[1m] ──────────────────────────────────────────────────────────────────────────────┐
│                                                                                     Autoscroll:On     FullScreen:Off     Timestamps:Off     Wrap:Off                                                                                     │
│                                                                                                                                                                                                                                          │
│ 10.244.0.0 - - [26/Jul/2023:07:20:01 +0000] "GET / HTTP/1.1" 308 164 "-" "curl/7.61.1" 76 0.000 [default-test-80] [] - - - - 1929917177420acbd522c89f9f034d20                                                                            │

However, the browser connection from outside the cluster is still problematic.

Strange

I conducted tests on two different machines and initially faced issues. However, upon retrying on a new machine (unknow to the server), I received an HTTP 200 status code, indicating a successful request. Alongside this, I also obtained some logs, which are provided below:

┌────────────────────────────────────────────────────────────────────────────── Logs(ingress-nginx/ingress-nginx-controller-5c778bffff-jn7hv:controller)[1m] ──────────────────────────────────────────────────────────────────────────────┐
│                                                                                     Autoscroll:On     FullScreen:Off     Timestamps:Off     Wrap:Off                                                                                     │
│                                                                                                                                                                                                                                          │
│ 10.244.1.1 - - [26/Jul/2023:07:57:46 +0000] "GET / HTTP/1.1" 308 164 "http://labo.bi/" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36" 453 0.000 [default-test-80] []  │
│ 10.244.1.1 - - [26/Jul/2023:07:57:49 +0000] "GET / HTTP/2.0" 200 45 "http://labo.bi/" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36" 456 0.002 [default-test-80] [] 1 │

What could be the cause of the issues ?

Any assistance would be much appreciated.

GuillaumeDorschner · 2023-07-26T09:35:23Z

I've observed an intriguing behavior within the pods. It seems that the resolution of the host test.labo.bi fails intermittently. Here's what I experienced when executing the curl command repeatedly:

bash-4.4$  curl -k https://test.labo.bi
curl: (6) Could not resolve host: test.labo.bi
bash-4.4$  curl -k https://test.labo.bi
curl: (6) Could not resolve host: test.labo.bi
bash-4.4$  curl -k https://test.labo.bi
<html><body><h1>It works!</h1></body></html> # Successful request here ✅
bash-4.4$  curl -k https://test.labo.bi
curl: (6) Could not resolve host: test.labo.bi
bash-4.4$  curl -k https://test.labo.bi
curl: (6) Could not resolve host: test.labo.bi
bash-4.4$  curl -k https://test.labo.bi
<html><body><h1>It works!</h1></body></html> # Another successful request here ✅
bash-4.4$  curl -k https://test.labo.bi
curl: (6) Could not resolve host: test.labo.bi
bash-4.4$  curl -k https://test.labo.bi
curl: (6) Could not resolve host: test.labo.bi

The "Could not resolve host" error occurs frequently, but occasionally, the request does succeed, indicating that the service is functional at times. This intermittent behavior is puzzling. Does anyone know why ????

GuillaumeDorschner · 2023-08-15T17:21:40Z

The ingress controller doesn't seem to be the problem, but I'm leaning towards an issue with CoreDNS. Given this, I'm closing the issue.

GuillaumeDorschner added the kind/bug Categorizes issue or PR as related to a bug. label Jul 18, 2023

k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Jul 18, 2023

k8s-ci-robot added the needs-priority label Jul 18, 2023

strongjz added this to [SIG Network] Ingress NGINX Jul 18, 2023

k8s-ci-robot added needs-kind Indicates a PR lacks a `kind/foo` label and requires one. and removed kind/bug Categorizes issue or PR as related to a bug. labels Jul 18, 2023

GuillaumeDorschner closed this as completed Aug 15, 2023

github-project-automation bot moved this to Done in [SIG Network] Ingress NGINX Aug 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Troubleshooting Controller Freezes #10211

Troubleshooting Controller Freezes #10211

GuillaumeDorschner commented Jul 18, 2023

k8s-ci-robot commented Jul 18, 2023

longwuyuan commented Jul 18, 2023

GuillaumeDorschner commented Jul 19, 2023

strongjz commented Jul 20, 2023

GuillaumeDorschner commented Jul 24, 2023 •

edited

Loading

strongjz commented Jul 25, 2023

GuillaumeDorschner commented Jul 26, 2023

GuillaumeDorschner commented Jul 26, 2023

GuillaumeDorschner commented Aug 15, 2023

Troubleshooting Controller Freezes #10211

Troubleshooting Controller Freezes #10211

Comments

GuillaumeDorschner commented Jul 18, 2023

k8s-ci-robot commented Jul 18, 2023

longwuyuan commented Jul 18, 2023

GuillaumeDorschner commented Jul 19, 2023

strongjz commented Jul 20, 2023

GuillaumeDorschner commented Jul 24, 2023 • edited Loading

strongjz commented Jul 25, 2023

GuillaumeDorschner commented Jul 26, 2023

GuillaumeDorschner commented Jul 26, 2023

GuillaumeDorschner commented Aug 15, 2023

GuillaumeDorschner commented Jul 24, 2023 •

edited

Loading