Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Troubleshooting Controller Freezes #10211

Closed
GuillaumeDorschner opened this issue Jul 18, 2023 · 9 comments
Closed

Troubleshooting Controller Freezes #10211

GuillaumeDorschner opened this issue Jul 18, 2023 · 9 comments
Labels
needs-kind Indicates a PR lacks a `kind/foo` label and requires one. needs-priority needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one.

Comments

@GuillaumeDorschner
Copy link

What happened:

The controller appears to crash sporadically, with the duration between crashes varying between a day and a week. During these crashes, the controller stops functioning entirely: it ceases to produce logs and fails to route. Restarting the controller usually remedies the issue; however, on some occasions, I need to reset the entire cluster to restore functionality.

What you expected to happen:

The controller should consistently and properly route to the service.

NGINX Ingress controller version (exec into the pod and run nginx-ingress-controller --version.):

-------------------------------------------------------------------------------
NGINX Ingress controller
  Release:       v1.8.1
  Build:         dc88dce9ea5e700f3301d16f971fa17c6cfe757d
  Repository:    https://github.com/kubernetes/ingress-nginx
  nginx version: nginx/1.21.6

-------------------------------------------------------------------------------

Kubernetes version :

WARNING: This version information is deprecated and will be replaced with the output from kubectl version --short.  Use --output=yaml|json to get the full version.
Client Version: version.Info{Major:"1", Minor:"27", GitVersion:"v1.27.2", GitCommit:"7f6f68fdabc4df88cfea2dcf9a19b2b830f1e647", GitTreeState:"clean", BuildDate:"2023-05-17T14:20:07Z", GoVersion:"go1.20.4", Compiler:"gc", Platform:"linux/amd64"}
Kustomize Version: v5.0.1
Server Version: version.Info{Major:"1", Minor:"27", GitVersion:"v1.27.3", GitCommit:"25b4e43193bcda6c7328a6d147b1fb73a33f1598", GitTreeState:"clean", BuildDate:"2023-06-14T09:47:40Z", GoVersion:"go1.20.5", Compiler:"gc", Platform:"linux/amd64"}

Environment:

  • Cloud provider or hardware configuration: on premise

  • OS (e.g. from /etc/os-release): almalinux 8.8

  • Kernel (e.g. uname -a): Linux master 4.18.0-477.13.1.el8_8.x86_64 Basic structure  #1 SMP Tue May 30 14:53:41 EDT 2023 x86_64 x86_64 x86_64 GNU/Linux

  • Install tools:
    kubeadm
    flannel
    metallb
    longhorn
    ingress nginx

  • Basic cluster related info:

    • kubectl version
WARNING: This version information is deprecated and will be replaced with the output from kubectl version --short.  Use --output=yaml|json to get the full version.
Client Version: version.Info{Major:"1", Minor:"27", GitVersion:"v1.27.2", GitCommit:"7f6f68fdabc4df88cfea2dcf9a19b2b830f1e647", GitTreeState:"clean", BuildDate:"2023-05-17T14:20:07Z", GoVersion:"go1.20.4", Compiler:"gc", Platform:"linux/amd64"}
Kustomize Version: v5.0.1
Server Version: version.Info{Major:"1", Minor:"27", GitVersion:"v1.27.3", GitCommit:"25b4e43193bcda6c7328a6d147b1fb73a33f1598", GitTreeState:"clean", BuildDate:"2023-06-14T09:47:40Z", GoVersion:"go1.20.5", Compiler:"gc", Platform:"linux/amd64"}
  • kubectl get nodes -o wide
NAME     STATUS   ROLES           AGE   VERSION   INTERNAL-IP      EXTERNAL-IP   OS-IMAGE                           KERNEL-VERSION                 CONTAINER-RUNTIME
master   Ready    control-plane   31m   v1.27.2   192.168.137.60   <none>        AlmaLinux 8.8 (Sapphire Caracal)   4.18.0-477.13.1.el8_8.x86_64   containerd://1.6.2
  • How was the ingress-nginx-controller installed:

The controller was installed via the following commands for both cloud and baremetal configurations:

kubectl apply -f https://raw.githubusercontent.com/kubernetes/ingress-nginx/main/deploy/static/provider/cloud/deploy.yaml

or

 kubectl apply -f https://raw.githubusercontent.com/kubernetes/ingress-nginx/main/deploy/static/provider/baremetal/deploy.yaml

then change the service to the LoadBalancer

  • Current State of the controller:
    • kubectl describe ingressclasses
Name:         nginx
Labels:       app.kubernetes.io/component=controller
              app.kubernetes.io/instance=ingress-nginx
              app.kubernetes.io/name=ingress-nginx
              app.kubernetes.io/part-of=ingress-nginx
              app.kubernetes.io/version=1.8.1
Annotations:  <none>
Controller:   k8s.io/ingress-nginx
Events:       <none>

The Ingresses

┌─────────────────────────────────────────────────────────────────────────────────────────────────────────── Ingresses(all)[4] ────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ NAMESPACE↑                      NAME                               CLASS                       HOSTS                                         ADDRESS                             PORTS                        AGE                        │
│ default                         keycloak                           nginx                       auth.labo.bi                                  192.168.137.60                      80, 443                      23m                        │
│ default                         minio                              <none>                      minio.labo.bi                                 192.168.137.60                      80, 443                      18m                        │
│ default                         minio-console                      <none>                      minio-console.labo.bi                         192.168.137.60                      80, 443                      18m                        │
│ default                         nginx-ingress                      nginx                       web.labo.bi,test.labo.bi                      192.168.137.60                      80, 443                      33m                        │
│                                                                                                                                                                                                                                         
@GuillaumeDorschner GuillaumeDorschner added the kind/bug Categorizes issue or PR as related to a bug. label Jul 18, 2023
@k8s-ci-robot k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Jul 18, 2023
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If Ingress contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@longwuyuan
Copy link
Contributor

/remove-kind bug

You can look at other info like kubectl get events -A

@k8s-ci-robot k8s-ci-robot added needs-kind Indicates a PR lacks a `kind/foo` label and requires one. and removed kind/bug Categorizes issue or PR as related to a bug. labels Jul 18, 2023
@GuillaumeDorschner
Copy link
Author

I'm getting this

╭─root@master ~
╰─# kubectl get events -A
NAMESPACE        LAST SEEN   TYPE      REASON             OBJECT                               MESSAGE
kube-flannel     5m8s        Warning   DNSConfigForming   pod/kube-flannel-ds-zclhl            Nameserver limits were exceeded, some nameservers have been omitted, the applied nameserver line is: 192.168.1.1 1.1.1.1 198.168.137.1
kube-system      4m44s       Warning   DNSConfigForming   pod/coredns-5d78c9869d-cbdbf         Nameserver limits were exceeded, some nameservers have been omitted, the applied nameserver line is: 192.168.1.1 1.1.1.1 198.168.137.1
kube-system      14s         Warning   DNSConfigForming   pod/coredns-5d78c9869d-lzmzs         Nameserver limits were exceeded, some nameservers have been omitted, the applied nameserver line is: 192.168.1.1 1.1.1.1 198.168.137.1
kube-system      1s          Warning   DNSConfigForming   pod/etcd-master                      Nameserver limits were exceeded, some nameservers have been omitted, the applied nameserver line is: 192.168.1.1 1.1.1.1 198.168.137.1
kube-system      5m42s       Warning   DNSConfigForming   pod/kube-apiserver-master            Nameserver limits were exceeded, some nameservers have been omitted, the applied nameserver line is: 192.168.1.1 1.1.1.1 198.168.137.1
kube-system      4m24s       Warning   DNSConfigForming   pod/kube-controller-manager-master   Nameserver limits were exceeded, some nameservers have been omitted, the applied nameserver line is: 192.168.1.1 1.1.1.1 198.168.137.1
kube-system      5m17s       Warning   DNSConfigForming   pod/kube-proxy-d2tmd                 Nameserver limits were exceeded, some nameservers have been omitted, the applied nameserver line is: 192.168.1.1 1.1.1.1 198.168.137.1
kube-system      36s         Warning   DNSConfigForming   pod/kube-scheduler-master            Nameserver limits were exceeded, some nameservers have been omitted, the applied nameserver line is: 192.168.1.1 1.1.1.1 198.168.137.1
metallb-system   3m5s        Warning   DNSConfigForming   pod/speaker-v4vwl                    Nameserver limits were exceeded, some nameservers have been omitted, the applied nameserver line is: 192.168.1.1 1.1.1.1 198.168.137.1

@strongjz
Copy link
Member

Are you making changes to the dns settings? It looks like there changes with 1.1.1.1 in the settings? Do you have more than 3 in the config?

Ingress-nginx uses whatever settings are available in the cluster.

There is a known issue of tyring to apply more than 3 https://kubernetes.io/docs/tasks/administer-cluster/dns-debugging-resolution/#known-issues

Linux's libc (a.k.a. glibc) has a limit for the DNS nameserver records to 3 by default and Kubernetes needs to consume 1 nameserver record.

@GuillaumeDorschner
Copy link
Author

GuillaumeDorschner commented Jul 24, 2023

Thank you for the tips @strongjz; I changed the DNS settings, and now I've got this in my /etc/resolv.conf file:

/etc/resolv.conf

nameserver 192.168.1.1
nameserver 198.168.137.1

However, my ingress doesn't seem to work, and I'm not sure why. I also noticed that I have a self-signed certificate. Could that be interfering with the ingress functionality?

My logs

│ -------------------------------------------------------------------------------                                                                                                                                                          │
│ NGINX Ingress controller                                                                                                                                                                                                                 │
│   Release:       v1.8.1                                                                                                                                                                                                                  │
│   Build:         dc88dce9ea5e700f3301d16f971fa17c6cfe757d                                                                                                                                                                                │
│   Repository:    https://github.com/kubernetes/ingress-nginx                                                                                                                                                                             │
│   nginx version: nginx/1.21.6                                                                                                                                                                                                            │
│ -------------------------------------------------------------------------------                                                                                                                                                          │
│ W0724 14:08:15.407860       7 client_config.go:618] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.                                                                                   │
│ I0724 14:08:15.407983       7 main.go:209] "Creating API client" host="https://10.96.0.1:443"                                                                                                                                            │
│ I0724 14:08:15.417764       7 main.go:253] "Running in Kubernetes cluster" major="1" minor="27" git="v1.27.4" state="clean" commit="fa3d7990104d7c1f16943a67f11b154b71f6a132" platform="linux/amd64"                                     │
│ I0724 14:08:15.642830       7 main.go:104] "SSL fake certificate created" file="/etc/ingress-controller/ssl/default-fake-certificate.pem"                                                                                                │
│ I0724 14:08:15.681703       7 ssl.go:533] "loading tls certificate" path="/usr/local/certificates/cert" key="/usr/local/certificates/key"                                                                                                │
│ I0724 14:08:15.696019       7 nginx.go:261] "Starting NGINX Ingress controller"                                                                                                                                                          │
│ I0724 14:08:15.711776       7 event.go:285] Event(v1.ObjectReference{Kind:"ConfigMap", Namespace:"ingress-nginx", Name:"ingress-nginx-controller", UID:"3d7d77de-6d33-4a66-9381-e43e3167b929", APIVersion:"v1", ResourceVersion:"698", F │
│ ieldPath:""}): type: 'Normal' reason: 'CREATE' ConfigMap ingress-nginx/ingress-nginx-controller                                                                                                                                          │
│ I0724 14:08:16.799729       7 store.go:432] "Found valid IngressClass" ingress="default/example" ingressclass="nginx"                                                                                                                    │
│ I0724 14:08:16.800230       7 event.go:285] Event(v1.ObjectReference{Kind:"Ingress", Namespace:"default", Name:"example", UID:"f684aaf3-b76b-4fb3-805d-1ee941e86127", APIVersion:"networking.k8s.io/v1", ResourceVersion:"2291", FieldPa │
│ th:""}): type: 'Normal' reason: 'Sync' Scheduled for sync                                                                                                                                                                                │
│ I0724 14:08:16.897374       7 nginx.go:304] "Starting NGINX process"                                                                                                                                                                     │
│ I0724 14:08:16.897465       7 leaderelection.go:248] attempting to acquire leader lease ingress-nginx/ingress-nginx-leader...                                                                                                            │
│ I0724 14:08:16.898026       7 nginx.go:324] "Starting validation webhook" address=":8443" certPath="/usr/local/certificates/cert" keyPath="/usr/local/certificates/key"                                                                  │
│ I0724 14:08:16.898703       7 controller.go:190] "Configuration changes detected, backend reload required"                                                                                                                               │
│ I0724 14:08:16.900720       7 status.go:84] "New leader elected" identity="ingress-nginx-controller-5c778bffff-hmhkm"                                                                                                                    │
│ I0724 14:08:16.994648       7 controller.go:207] "Backend successfully reloaded"                                                                                                                                                         │
│ I0724 14:08:16.994825       7 controller.go:218] "Initial sync, sleeping for 1 second"                                                                                                                                                   │
│ I0724 14:08:16.994929       7 event.go:285] Event(v1.ObjectReference{Kind:"Pod", Namespace:"ingress-nginx", Name:"ingress-nginx-controller-5c778bffff-8p5d2", UID:"a25d1e79-78e0-4f78-a538-c26facd3cf21", APIVersion:"v1", ResourceVersi │
│ on:"2331", FieldPath:""}): type: 'Normal' reason: 'RELOAD' NGINX reload triggered due to a change in configuration                                                                                                                       │
│ I0724 14:08:54.092960       7 leaderelection.go:258] successfully acquired lease ingress-nginx/ingress-nginx-leader                                                                                                                      │
│ I0724 14:08:54.093070       7 status.go:84] "New leader elected" identity="ingress-nginx-controller-5c778bffff-8p5d2"                                                                                                                    │
│ I0724 14:08:54.107062       7 status.go:300] "updating Ingress status" namespace="default" ingress="example" currentValue=null newValue=[{"ip":"192.168.137.61"}]                                                                        │
│ I0724 14:08:54.112575       7 event.go:285] Event(v1.ObjectReference{Kind:"Ingress", Namespace:"default", Name:"example", UID:"f684aaf3-b76b-4fb3-805d-1ee941e86127", APIVersion:"networking.k8s.io/v1", ResourceVersion:"2415", FieldPa │
│ th:""}): type: 'Normal' reason: 'Sync' Scheduled for sync 

@strongjz
Copy link
Member

Those logs look fine for a controllerstartup. Can you be more precise with what's not working?

@GuillaumeDorschner
Copy link
Author

@strongjz, I apologize for the late response. I had to wait for the error to reproduce. After a certain duration, it appears that nginx is not performing its duties properly. The pod is running but there are no insights in the logs. I'm unable to see the GET request of a pod (tested using curl or a browser, from both inside and outside the cluster). For your information, I'm using flannel, metallb, and cert-manager with my self-signed certificate.

Here are the outcomes of my traceroute and curl tests:

Traceroute

From outside the cluster:

╭─root@Admin-Pc ~
╰─# traceroute test.labo.bi
traceroute to test.labo.bi (192.168.137.10), 30 hops max, 60 byte packets
 1  * * *
 2  * * *
 3  * * *
 4  * * *
 5  * * *
 6  *^C

From inside the cluster:

╭─root@master ~
╰─# traceroute test.labo.bi
traceroute to test.labo.bi (192.168.137.10), 30 hops max, 60 byte packets
 1  worker1 (192.168.137.61)  0.498 ms  0.458 ms  0.432 ms
 2  * * *
 3  * * *
 4  * * *
 5  * * *
 6  * * *
 7  *^C

Curl

From outside the cluster:

╭─root@Admin-Pc ~
╰─# curl test.labo.bi
curl: (7) Failed to connect to test.labo.bi port 80: Connection timed out

From inside the cluster:

╭─root@Admin-Pc ~
╰─# curl test.labo.bi
<html>
<head><title>308 Permanent Redirect</title></head>
<body>
<center><h1>308 Permanent Redirect</h1></center>
<hr><center>nginx</center>
</body>
</html>

this give me the following logs:

┌────────────────────────────────────────────────────────────────────────────── Logs(ingress-nginx/ingress-nginx-controller-5c778bffff-jn7hv:controller)[1m] ──────────────────────────────────────────────────────────────────────────────┐
│                                                                                     Autoscroll:On     FullScreen:Off     Timestamps:Off     Wrap:Off                                                                                     │
│                                                                                                                                                                                                                                          │
│ 10.244.0.0 - - [26/Jul/2023:07:20:01 +0000] "GET / HTTP/1.1" 308 164 "-" "curl/7.61.1" 76 0.000 [default-test-80] [] - - - - 1929917177420acbd522c89f9f034d20                                                                            │

However, the browser connection from outside the cluster is still problematic.
image

Strange

I conducted tests on two different machines and initially faced issues. However, upon retrying on a new machine (unknow to the server), I received an HTTP 200 status code, indicating a successful request. Alongside this, I also obtained some logs, which are provided below:
image

┌────────────────────────────────────────────────────────────────────────────── Logs(ingress-nginx/ingress-nginx-controller-5c778bffff-jn7hv:controller)[1m] ──────────────────────────────────────────────────────────────────────────────┐
│                                                                                     Autoscroll:On     FullScreen:Off     Timestamps:Off     Wrap:Off                                                                                     │
│                                                                                                                                                                                                                                          │
│ 10.244.1.1 - - [26/Jul/2023:07:57:46 +0000] "GET / HTTP/1.1" 308 164 "http://labo.bi/" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36" 453 0.000 [default-test-80] []  │
│ 10.244.1.1 - - [26/Jul/2023:07:57:49 +0000] "GET / HTTP/2.0" 200 45 "http://labo.bi/" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36" 456 0.002 [default-test-80] [] 1 │

What could be the cause of the issues ?

Any assistance would be much appreciated.

@GuillaumeDorschner
Copy link
Author

I've observed an intriguing behavior within the pods. It seems that the resolution of the host test.labo.bi fails intermittently. Here's what I experienced when executing the curl command repeatedly:

bash-4.4$  curl -k https://test.labo.bi
curl: (6) Could not resolve host: test.labo.bi
bash-4.4$  curl -k https://test.labo.bi
curl: (6) Could not resolve host: test.labo.bi
bash-4.4$  curl -k https://test.labo.bi
<html><body><h1>It works!</h1></body></html> # Successful request here ✅
bash-4.4$  curl -k https://test.labo.bi
curl: (6) Could not resolve host: test.labo.bi
bash-4.4$  curl -k https://test.labo.bi
curl: (6) Could not resolve host: test.labo.bi
bash-4.4$  curl -k https://test.labo.bi
<html><body><h1>It works!</h1></body></html> # Another successful request here ✅
bash-4.4$  curl -k https://test.labo.bi
curl: (6) Could not resolve host: test.labo.bi
bash-4.4$  curl -k https://test.labo.bi
curl: (6) Could not resolve host: test.labo.bi

The "Could not resolve host" error occurs frequently, but occasionally, the request does succeed, indicating that the service is functional at times. This intermittent behavior is puzzling. Does anyone know why ????

@GuillaumeDorschner
Copy link
Author

The ingress controller doesn't seem to be the problem, but I'm leaning towards an issue with CoreDNS. Given this, I'm closing the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs-kind Indicates a PR lacks a `kind/foo` label and requires one. needs-priority needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one.
Projects
Archived in project
Development

No branches or pull requests

4 participants