Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ingress-nginx-controller v1.8.1 version will cause intermittent network requests to get stuck #10276

Closed
tony-liuliu opened this issue Aug 4, 2023 · 18 comments
Labels
lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. needs-kind Indicates a PR lacks a `kind/foo` label and requires one. needs-priority needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one.

Comments

@tony-liuliu
Copy link

tony-liuliu commented Aug 4, 2023

Problem phenomenon:
After deploying the latest ingress-nginx-controller, requests to port 80 or 443 of the nginx-controller pod IP address will always be stuck, even if you enter the ingress-nginx-controller container and use curl 127.0.0.1, it will also get stuck Phenomenon, please help me to find out what the problem is.

All requests for non-ingress-nginx-controller services are running normally, including the health check port 10254 of the ingress-nginx-controller service.

Environmental information:
kubernetes version: 1.27.4
OS: CentOS : CentOS Linux release 7.9.2009 (Core)
Linux kernel: Linux dong-k8s-90 4.20.13-1.el7.elrepo.x86_64 #1 SMP Wed Feb 27 10:02:05 EST 2019 x86_64 x86_64 x86_64 GNU/Linux
runtime: containerd://1.7.2

Install tools:

[root@dong-k8s-90 ingress-nginx-controller]# kubectl version
WARNING: This version information is deprecated and will be replaced with the output from kubectl version --short.  Use --output=yaml|json to get the full version.
Client Version: version.Info{Major:"1", Minor:"27", GitVersion:"v1.27.4", GitCommit:"fa3d7990104d7c1f16943a67f11b154b71f6a132", GitTreeState:"clean", BuildDate:"2023-07-19T12:20:54Z", GoVersion:"go1.20.6", Compiler:"gc", Platform:"linux/amd64"}
Kustomize Version: v5.0.1
Server Version: version.Info{Major:"1", Minor:"27", GitVersion:"v1.27.4", GitCommit:"fa3d7990104d7c1f16943a67f11b154b71f6a132", GitTreeState:"clean", BuildDate:"2023-07-19T12:14:49Z", GoVersion:"go1.20.6", Compiler:"gc", Platform:"linux/amd64"}
[root@dong-k8s-90 ingress-nginx-controller]# kubectl get node -o wide
NAME              STATUS   ROLES           AGE   VERSION   INTERNAL-IP    EXTERNAL-IP   OS-IMAGE                KERNEL-VERSION                CONTAINER-RUNTIME
dong-k8s-90   Ready    control-plane   15d   v1.27.4   10.206.60.90   <none>        CentOS Linux 7 (Core)   4.20.13-1.el7.elrepo.x86_64   containerd://1.7.2
dong-k8s-91   Ready    control-plane   15d   v1.27.4   10.206.60.91   <none>        CentOS Linux 7 (Core)   4.20.13-1.el7.elrepo.x86_64   containerd://1.7.2
dong-k8s-92   Ready    control-plane   15d   v1.27.4   10.206.60.92   <none>        CentOS Linux 7 (Core)   4.20.13-1.el7.elrepo.x86_64   containerd://1.7.2
dong-k8s-93   Ready    <none>          15d   v1.27.4   10.206.60.93   <none>        CentOS Linux 7 (Core)   4.20.13-1.el7.elrepo.x86_64   containerd://1.7.2
dong-k8s-95   Ready    <none>          15d   v1.27.4   10.206.60.95   <none>        CentOS Linux 7 (Core)   4.20.13-1.el7.elrepo.x86_64   containerd://1.7.2

CNI: calico-3.26.1 using IPIP mode, Deployment manifest used https://raw.githubusercontent.com/projectcalico/calico/v3.26.1/manifests/calico.yaml

How was the ingress-nginx-controller installed:
ingress-nginx-controller version: v1.8.1 Deployment manifest used https://github.com/kubernetes/ingress-nginx/blob/main/deploy/static/provider/baremetal/deploy.yaml

Current State of the controller:

[root@dong-k8s-90 ingress-nginx-controller]# kubectl describe ingressclasses
Name:         nginx
Labels:       app.kubernetes.io/component=controller
              app.kubernetes.io/instance=ingress-nginx
              app.kubernetes.io/name=ingress-nginx
              app.kubernetes.io/part-of=ingress-nginx
              app.kubernetes.io/version=1.8.1
Annotations:  <none>
Controller:   k8s.io/ingress-nginx
Events:       <none>
[root@dong-k8s-90 ingress-nginx-controller]# kubectl -n ingress-nginx describe po ingress-nginx-controller-7898b9666d-7zwg6 
Name:             ingress-nginx-controller-7898b9666d-7zwg6
Namespace:        ingress-nginx
Priority:         0
Service Account:  ingress-nginx
Node:             dong-k8s-95/10.206.60.95
Start Time:       Sun, 06 Aug 2023 13:19:51 +0800
Labels:           app.kubernetes.io/component=controller
                  app.kubernetes.io/instance=ingress-nginx
                  app.kubernetes.io/name=ingress-nginx
                  app.kubernetes.io/part-of=ingress-nginx
                  app.kubernetes.io/version=1.8.1
                  pod-template-hash=7898b9666d
Annotations:      cni.projectcalico.org/containerID: 298f9ee44d0a3ff61f7fad9ef8cdd1983a52c1b3b70780a5f7d27a1a6ecd7af4
                  cni.projectcalico.org/podIP: 10.244.158.227/32
                  cni.projectcalico.org/podIPs: 10.244.158.227/32
Status:           Running
IP:               10.244.158.227
IPs:
  IP:           10.244.158.227
Controlled By:  ReplicaSet/ingress-nginx-controller-7898b9666d
Containers:
  controller:
    Container ID:  containerd://09e4e4a164020e089e5fbd144b8d20493a545894b36f980c6c4b9311eb3c04fb
    Image:         docker.sre.com/ingress-nginx/controller:v1.8.1
    Image ID:      docker.sre.com/ingress-nginx/controller@sha256:bd54c330f73b17d0bf19f3ec3832b285d43a4c9fa5fe15f5a7accd3de706b438
    Ports:         80/TCP, 443/TCP, 8443/TCP
    Host Ports:    0/TCP, 0/TCP, 0/TCP
    Args:
      /nginx-ingress-controller
      --election-id=ingress-nginx-leader
      --controller-class=k8s.io/ingress-nginx
      --ingress-class=nginx
      --configmap=$(POD_NAMESPACE)/ingress-nginx-controller
      --validating-webhook=:8443
      --validating-webhook-certificate=/usr/local/certificates/cert
      --validating-webhook-key=/usr/local/certificates/key
      --v=4
    State:          Running
      Started:      Sun, 06 Aug 2023 13:19:54 +0800
    Ready:          True
    Restart Count:  0
    Requests:
      cpu:      100m
      memory:   90Mi
    Liveness:   http-get http://:10254/healthz delay=10s timeout=1s period=10s #success=1 #failure=5
    Readiness:  http-get http://:10254/healthz delay=10s timeout=1s period=10s #success=1 #failure=3
    Environment:
      POD_NAME:       ingress-nginx-controller-7898b9666d-7zwg6 (v1:metadata.name)
      POD_NAMESPACE:  ingress-nginx (v1:metadata.namespace)
      LD_PRELOAD:     /usr/local/lib/libmimalloc.so
    Mounts:
      /usr/local/certificates/ from webhook-cert (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-fqwfp (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             True 
  ContainersReady   True 
  PodScheduled      True 
Volumes:
  webhook-cert:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  ingress-nginx-admission
    Optional:    false
  kube-api-access-fqwfp:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              kubernetes.io/os=linux
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason       Age                    From                      Message
  ----     ------       ----                   ----                      -------
  Normal   Scheduled    4m5s                   default-scheduler         Successfully assigned ingress-nginx/ingress-nginx-controller-7898b9666d-7zwg6 to dong-k8s-95
  Warning  FailedMount  3m54s (x2 over 3m55s)  kubelet                   MountVolume.SetUp failed for volume "webhook-cert" : secret "ingress-nginx-admission" not found
  Normal   Pulled       3m52s                  kubelet                   Container image "docker.sre.com/ingress-nginx/controller:v1.8.1" already present on machine
  Normal   Created      3m52s                  kubelet                   Created container controller
  Normal   Started      3m52s                  kubelet                   Started container controller
  Normal   RELOAD       3m51s                  nginx-ingress-controller  NGINX reload triggered due to a change in configuration
[root@dong-k8s-90 ingress-nginx-controller]# kubectl -n ingress-nginx describe svc ingress-nginx-controller
Name:                     ingress-nginx-controller
Namespace:                ingress-nginx
Labels:                   app.kubernetes.io/component=controller
                          app.kubernetes.io/instance=ingress-nginx
                          app.kubernetes.io/name=ingress-nginx
                          app.kubernetes.io/part-of=ingress-nginx
                          app.kubernetes.io/version=1.8.1
Annotations:              <none>
Selector:                 app.kubernetes.io/component=controller,app.kubernetes.io/instance=ingress-nginx,app.kubernetes.io/name=ingress-nginx
Type:                     NodePort
IP Family Policy:         SingleStack
IP Families:              IPv4
IP:                       10.97.230.39
IPs:                      10.97.230.39
Port:                     http  80/TCP
TargetPort:               http/TCP
NodePort:                 http  30882/TCP
Endpoints:                10.244.158.227:80
Port:                     https  443/TCP
TargetPort:               https/TCP
NodePort:                 https  31057/TCP
Endpoints:                10.244.158.227:443
Session Affinity:         None
External Traffic Policy:  Cluster
Events:                   <none>

The following is the packet capture information when something goes wrong:

The client initiates a curl request

[root@dong-k8s-90 ingress-nginx-controller]# curl 10.244.32.32 -v
* About to connect() to 10.244.32.32 port 80 (#0)
*Trying 10.244.32.32...
* Connected to 10.244.32.32 (10.244.32.32) port 80 (#0)
> GET /HTTP/1.1
> User-Agent: curl/7.29.0
> Host: 10.244.32.32
> Accept: */*
>

It has been stuck in this state and has not returned.

ps: Because the pod has been restarted, the IP address seen has changed and the information captured is different.

The request packet captured by the client

[root@dong-k8s-90 ingress-nginx-controller]# tcpdump -nn -n -i tunl0 host 10.244.32.32 and port 80 -e -v
tcpdump: listening on tunl0, link-type RAW (Raw IP), capture size 262144 bytes
17:30:45.367189 ip: (tos 0x0, ttl 64, id 2003, offset 0, flags [DF], proto TCP (6), length 60)
     10.244.137.192.19066 > 10.244.32.32.80: Flags [S], cksum 0xbff6 (incorrect -> 0x7284), seq 1217195127, win 64800, options [mss 1440, sackOK, TS val 2772 693908 ecr 0,nop,wscale 7] , length 0
17:30:45.367699 ip: (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto TCP (6), length 60)
     10.244.32.32.80 > 10.244.137.192.19066: Flags [S.], cksum 0xa821 (correct), seq 402895697, ack 1217195128, win 64260, options [mss 1440,sackOK,TS val 78445676 ecr 2772693908,nop,wscale 7 ], length 0
17:30:45.367810 ip: (tos 0x0, ttl 64, id 2004, offset 0, flags [DF], proto TCP (6), length 52)
     10.244.137.192.19066 > 10.244.32.32.80: Flags [.], cksum 0xbfee (incorrect -> 0xcfe2), ack 1, win 507, options [nop,nop,TS val 2772693909 ecr 78445676] , length 0
17:30:45.367949 ip: (tos 0x0, ttl 64, id 2005, offset 0, flags [DF], proto TCP (6), length 128)
     10.244.137.192.19066 > 10.244.32.32.80: Flags [P.], cksum 0xc03a (incorrect -> 0x806e), seq 1:77, ack 1, win 507, options [nop,nop,TS val 2772693909 ecr 78445676] , length 76: HTTP, length: 76
         GET / HTTP/1.1
         User-Agent: curl/7.29.0
         Host: 10.244.32.32
         Accept: */*

17:30:45.368698 ip: (tos 0x0, ttl 63, id 33244, offset 0, flags [DF], proto TCP (6), length 52)
     10.244.32.32.80 > 10.244.137.192.19066: Flags [.], cksum 0xcf9a (correct), ack 77, win 502, options [nop,nop,TS val 78445677 ecr 2772693909], length 0
17:30:55.449188 ip: (tos 0x0, ttl 64, id 2006, offset 0, flags [DF], proto TCP (6), length 52)
     10.244.137.192.19066 > 10.244.32.32.80: Flags [F.], cksum 0xbfee (incorrect -> 0xa833), seq 77, ack 1, win 507, options [nop,nop,TS val 2772703990 ecr 78445677], length 0
17:30:55.490585 ip: (tos 0x0, ttl 63, id 33245, offset 0, flags [DF], proto TCP (6), length 52)
     10.244.32.32.80 > 10.244.137.192.19066: Flags [.], cksum 0x80ae (correct), ack 78, win 502, options [nop,nop,TS val 78455799 ecr 2772703990], length 0

ingress-nginx-controller container network capture

[root@dong-k8s-93 ~]# ps -ef|grep nginx
101 15699 15227 0 16:51 ? 00:00:00 /usr/bin/dumb-init -- /nginx-ingress-controller --election-id=ingress-nginx-leader --controller-class=k8s.io/ ingress-nginx --ingress-class=nginx --configmap=ingress-nginx/ingress-nginx-controller --validating-webhook=:8443 --validating-webhook-certificate=/usr/local/certificates/cert --validating -webhook-key=/usr/local/certificates/key
101 15833 15699 0 16:51 ? 00:00:03 /nginx-ingress-controller --election-id=ingress-nginx-leader --controller-class=k8s.io/ingress-nginx --ingress-class=nginx --configmap=ingress-nginx/ingress-nginx-controller --validating-webhook=:8443 --validating-webhook-certificate=/usr/local/certificates/cert --validating-webhook-key=/usr/local/ certificates/key
101 16546 15833 0 16:51 ? 00:00:00 nginx: master process /usr/bin/nginx -c /etc/nginx/nginx.conf
[root@dong-k8s-93 ~]# nsenter -n -t 15699
[root@dong-k8s-93 ~]# ifconfig
eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1480
         inet 10.244.32.32 netmask 255.255.255.255 broadcast 0.0.0.0
         inet6 fe80::4c68:83ff:fe5d:687e prefixlen 64 scopeid 0x20<link>
         ether 4e:68:83:5d:68:7e txqueuelen 1000 (Ethernet)
         RX packets 12056 bytes 3037628 (2.8 MiB)
         RX errors 0 dropped 0 overruns 0 frame 0
         TX packets 10075 bytes 1263907 (1.2 MiB)
         TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536
         inet 127.0.0.1 netmask 255.0.0.0
         inet6 ::1 prefixlen 128 scopeid 0x10<host>
         loop txqueuelen 1000 (Local Loopback)
         RX packets 15365 bytes 1243138 (1.1 MiB)
         RX errors 0 dropped 0 overruns 0 frame 0
         TX packets 15365 bytes 1243138 (1.1 MiB)
         TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
[root@dong-k8s-93 ~]# tcpdump -nn -n port 80 -e -v
tcpdump: listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
17:30:52.367684 ee:ee:ee:ee:ee:ee > 4e:68:83:5d:68:7e, ethertype IPv4 (0x0800), length 74: (tos 0x0, ttl 63, id 2003, offset 0, flags [DF], proto TCP (6), length 60)
    10.244.137.192.19066 > 10.244.32.32.80: Flags [S], cksum 0x7284 (correct), seq 1217195127, win 64800, options [mss 1440,sackOK,TS val 2772693908 ecr 0,nop,wscale 7], length 0
17:30:52.367761 4e:68:83:5d:68:7e > ee:ee:ee:ee:ee:ee, ethertype IPv4 (0x0800), length 74: (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 60)
    10.244.32.32.80 > 10.244.137.192.19066: Flags [S.], cksum 0xbff6 (incorrect -> 0xa821), seq 402895697, ack 1217195128, win 64260, options [mss 1440,sackOK,TS val 78445676 ecr 2772693908,nop,wscale 7], length 0
17:30:52.368114 ee:ee:ee:ee:ee:ee > 4e:68:83:5d:68:7e, ethertype IPv4 (0x0800), length 66: (tos 0x0, ttl 63, id 2004, offset 0, flags [DF], proto TCP (6), length 52)
    10.244.137.192.19066 > 10.244.32.32.80: Flags [.], cksum 0xcfe2 (correct), ack 1, win 507, options [nop,nop,TS val 2772693909 ecr 78445676], length 0
17:30:52.368615 ee:ee:ee:ee:ee:ee > 4e:68:83:5d:68:7e, ethertype IPv4 (0x0800), length 142: (tos 0x0, ttl 63, id 2005, offset 0, flags [DF], proto TCP (6), length 128)
    10.244.137.192.19066 > 10.244.32.32.80: Flags [P.], cksum 0x806e (correct), seq 1:77, ack 1, win 507, options [nop,nop,TS val 2772693909 ecr 78445676], length 76: HTTP, length: 76
        GET / HTTP/1.1
        User-Agent: curl/7.29.0
        Host: 10.244.32.32
        Accept: */*
17:30:52.368641 4e:68:83:5d:68:7e > ee:ee:ee:ee:ee:ee, ethertype IPv4 (0x0800), length 66: (tos 0x0, ttl 64, id 33244, offset 0, flags [DF], proto TCP (6), length 52)
    10.244.32.32.80 > 10.244.137.192.19066: Flags [.], cksum 0xbfee (incorrect -> 0xcf9a), ack 77, win 502, options [nop,nop,TS val 78445677 ecr 2772693909], length 0
17:31:02.449630 ee:ee:ee:ee:ee:ee > 4e:68:83:5d:68:7e, ethertype IPv4 (0x0800), length 66: (tos 0x0, ttl 63, id 2006, offset 0, flags [DF], proto TCP (6), length 52)
    10.244.137.192.19066 > 10.244.32.32.80: Flags [F.], cksum 0xa833 (correct), seq 77, ack 1, win 507, options [nop,nop,TS val 2772703990 ecr 78445677], length 0
17:31:02.490541 4e:68:83:5d:68:7e > ee:ee:ee:ee:ee:ee, ethertype IPv4 (0x0800), length 66: (tos 0x0, ttl 64, id 33245, offset 0, flags [DF], proto TCP (6), length 52)
    10.244.32.32.80 > 10.244.137.192.19066: Flags [.], cksum 0xbfee (incorrect -> 0x80ae), ack 78, win 502, options [nop,nop,TS val 78455799 ecr 2772703990], length 0

It will cause the client to be stuck all the time. This frequency is very high Please help me to find out what is causing the problem.

@tony-liuliu tony-liuliu added the kind/bug Categorizes issue or PR as related to a bug. label Aug 4, 2023
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If Ingress contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority labels Aug 4, 2023
@tao12345666333
Copy link
Member

Have you tried testing the network in your cluster first? For example, without ingress-nginx

@longwuyuan
Copy link
Contributor

/remove-kind bug

@k8s-ci-robot k8s-ci-robot added needs-kind Indicates a PR lacks a `kind/foo` label and requires one. and removed kind/bug Categorizes issue or PR as related to a bug. labels Aug 4, 2023
@tony-liuliu
Copy link
Author

tony-liuliu commented Aug 5, 2023

Have you tried testing the network in your cluster first? For example, without ingress-nginx

Yes, very sure, accessing services except ingress-nginx is very normal. There is no such network stuck situation.

Test Results:

[root@dong-k8s-90 ingress-nginx-controller]# kubectl -n kubernetes-dashboard get pod -o wide
NAME                                                    READY   STATUS    RESTARTS   AGE   IP             NODE              NOMINATED NODE   READINESS GATES
kubernetes-dashboard-api-949ddd7bb-6qzpp                1/1     Running   0          18h   10.244.32.42   dong-k8s-93   <none>           <none>
kubernetes-dashboard-metrics-scraper-6c6c7b7cf4-5fk8r   1/1     Running   0          18h   10.244.32.38   dong-k8s-93   <none>           <none>
kubernetes-dashboard-web-5476467fcc-vhcv7               1/1     Running   0          18h   10.244.32.36   dong-k8s-93   <none>           <none>
[root@dong-k8s-90 ingress-nginx-controller]# time for i in `seq 1 1000`;do echo $i;curl -I http://10.244.32.42:9000/api/;done
......
1000
HTTP/1.1 404 Not Found
Content-Type: text/plain; charset=utf-8
X-Content-Type-Options: nosniff
Date: Sat, 05 Aug 2023 03:11:21 GMT
Content-Length: 19

real    0m6.018s
user    0m2.126s
sys     0m3.329s

@longwuyuan
Copy link
Contributor

longwuyuan commented Aug 6, 2023

@tony-liuliu There is no answers to the questions asked in a issue template so everything you are saying here assumes that your clster and environment is 100% perfect in acceptable state. It also assumes that your installation of the ingress-nginx controller is 100% perfect. That does not work when a deep dive is required.

Please provide the details as asked in a new issue tmplate.

@allandegnan
Copy link

allandegnan commented Aug 6, 2023

Have the same issue, latest helm chart. Everything else working beside ingress-nginx. Works sometimes, sometimes holds the connection open and nothing happens.

Will retort with the issue template questions later in the day.

@tao12345666333
Copy link
Member

any logs?

@tony-liuliu
Copy link
Author

tony-liuliu commented Aug 7, 2023

After my test today, I found that the reason why the nginx-controller network is intermittently stuck may be related to this:

The CPU configuration of the kvm virtual machine node running on the nginx-controller pod is 16 cores. I checked that the default configuration of worker-processes is auto. Normally, 16 worker processes are created, but only 13 are created here.

For example, when the default value of worker-processes is auto(16), the nginx-controller network will be stuck intermittently,Because after testing, I found that only 13 worker processes were actually created, which may be the main cause of the problem.

[root@dong-k8s-90 ingress-nginx-controller]# kubectl -n ingress-nginx exec -it ingress-nginx-controller-7d6797bbcb-pgdj7 sh
/etc/nginx $ head 10 /etc/nginx/nginx.conf
head: 10: No such file or directory

==> /etc/nginx/nginx.conf <==

# Configuration checksum: 15638244883250834871

# setup custom paths that do not require root access
pid /tmp/nginx/nginx.pid;

daemon off;

worker_processes 16;
/etc/nginx $ ps -ef
PID   USER     TIME  COMMAND
    1 www-data  0:00 /usr/bin/dumb-init -- /nginx-ingress-controller --election-id=ingress-nginx-leader --controller-class=k8s.io/ingress-nginx --ingress-class=nginx --configmap=ingress-nginx/ingress-nginx-controller --validating-webhook=:8443 --validating-webhook-certificate=/usr/local/certificates/cert --vali
    7 www-data  0:01 /nginx-ingress-controller --election-id=ingress-nginx-leader --controller-class=k8s.io/ingress-nginx --ingress-class=nginx --configmap=ingress-nginx/ingress-nginx-controller --validating-webhook=:8443 --validating-webhook-certificate=/usr/local/certificates/cert --validating-webhook-key=/us
   33 www-data  0:00 nginx: master process /usr/bin/nginx -c /etc/nginx/nginx.conf
   38 www-data  0:00 nginx: worker process
   39 www-data  0:00 nginx: worker process
   40 www-data  0:00 nginx: worker process
   41 www-data  0:00 nginx: worker process
   42 www-data  0:00 nginx: worker process
   43 www-data  0:00 nginx: worker process
   44 www-data  0:00 nginx: worker process
   45 www-data  0:00 nginx: worker process
   46 www-data  0:00 nginx: worker process
   47 www-data  0:00 nginx: worker process
   48 www-data  0:00 nginx: worker process
   49 www-data  0:00 nginx: worker process
   50 www-data  0:00 nginx: worker process
   64 www-data  0:00 nginx: cache manager process
  517 www-data  0:00 sh
  536 www-data  0:00 ps -ef
/etc/nginx $ ps -ef|grep 'worker process'|grep -v grep|wc -l
13

When I manually adjust worker-processes to 13 or less, the network requests of worker-processes will be normal:

[root@dong-k8s-90 ingress-nginx-controller]# vim ingress-nginx-controller-1.8.1.yaml
......
---
apiVersion: v1
data:
  allow-snippet-annotations: "true"
  worker-processes: "13"
kind: ConfigMap
metadata:
  labels:
    app.kubernetes.io/component: controller
    app.kubernetes.io/instance: ingress-nginx
    app.kubernetes.io/name: ingress-nginx
    app.kubernetes.io/part-of: ingress-nginx
    app.kubernetes.io/version: 1.8.1
  name: ingress-nginx-controller
  namespace: ingress-nginx
......

[root@dong-k8s-90 ingress-nginx-controller]# kubectl apply -f ingress-nginx-controller-1.8.1.yaml 
namespace/ingress-nginx unchanged
serviceaccount/ingress-nginx unchanged
serviceaccount/ingress-nginx-admission unchanged
role.rbac.authorization.k8s.io/ingress-nginx unchanged
role.rbac.authorization.k8s.io/ingress-nginx-admission unchanged
clusterrole.rbac.authorization.k8s.io/ingress-nginx unchanged
clusterrole.rbac.authorization.k8s.io/ingress-nginx-admission unchanged
rolebinding.rbac.authorization.k8s.io/ingress-nginx unchanged
rolebinding.rbac.authorization.k8s.io/ingress-nginx-admission unchanged
clusterrolebinding.rbac.authorization.k8s.io/ingress-nginx unchanged
clusterrolebinding.rbac.authorization.k8s.io/ingress-nginx-admission unchanged
configmap/ingress-nginx-controller configured
service/ingress-nginx-controller unchanged
service/ingress-nginx-controller-admission unchanged
deployment.apps/ingress-nginx-controller configured
job.batch/ingress-nginx-admission-create unchanged
job.batch/ingress-nginx-admission-patch unchanged
ingressclass.networking.k8s.io/nginx unchanged
validatingwebhookconfiguration.admissionregistration.k8s.io/ingress-nginx-admission configured

[root@dong-k8s-90 ingress-nginx-controller]# kubectl -n ingress-nginx rollout restart deployment ingress-nginx-controller 
deployment.apps/ingress-nginx-controller restarted

[root@dong-k8s-90 ingress-nginx-controller]# kubectl -n ingress-nginx get pod -o wide
NAME                                        READY   STATUS      RESTARTS   AGE   IP               NODE              NOMINATED NODE   READINESS GATES
ingress-nginx-admission-create-58w7p        0/1     Completed   0          28h   10.244.158.225   dong-k8s-95   <none>           <none>
ingress-nginx-admission-patch-ctgjm         0/1     Completed   0          28h   10.244.158.226   dong-k8s-95   <none>           <none>
ingress-nginx-controller-74597567dd-njqzp   1/1     Running     0          16s   10.244.158.232   dong-k8s-95   <none>           <none>
[root@dong-k8s-90 ingress-nginx-controller]# kubectl -n ingress-nginx exec -it ingress-nginx-controller-74597567dd-njqzp sh
kubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future version. Use kubectl exec [POD] -- [COMMAND] instead.
/etc/nginx $ ps -ef
PID   USER     TIME  COMMAND
    1 www-data  0:00 /usr/bin/dumb-init -- /nginx-ingress-controller --election-id=ingress-nginx-leader --controller-class=k8s.io/ingress-nginx --ingress-class=nginx --configmap=ingress-nginx/ingress-nginx-controller --validating-webhook=:8443 --validating-webhook-certificate=/usr/local/certificates/cert --vali
    7 www-data  0:01 /nginx-ingress-controller --election-id=ingress-nginx-leader --controller-class=k8s.io/ingress-nginx --ingress-class=nginx --configmap=ingress-nginx/ingress-nginx-controller --validating-webhook=:8443 --validating-webhook-certificate=/usr/local/certificates/cert --validating-webhook-key=/us
   32 www-data  0:00 nginx: master process /usr/bin/nginx -c /etc/nginx/nginx.conf
   37 www-data  0:00 nginx: worker process
   38 www-data  0:00 nginx: worker process
   39 www-data  0:00 nginx: worker process
   40 www-data  0:00 nginx: worker process
   41 www-data  0:00 nginx: worker process
   42 www-data  0:00 nginx: worker process
   43 www-data  0:00 nginx: worker process
   44 www-data  0:00 nginx: worker process
   45 www-data  0:00 nginx: worker process
   46 www-data  0:00 nginx: worker process
   47 www-data  0:00 nginx: worker process
   48 www-data  0:00 nginx: worker process
   49 www-data  0:00 nginx: worker process
   50 www-data  0:00 nginx: cache manager process
   53 www-data  0:00 nginx: cache loader process
  468 www-data  0:00 sh
  474 www-data  0:00 ps -ef
/etc/nginx $ ps -ef|grep 'worker process'|grep -v grep|wc -l
13

I tried to constantly adjust the value of worker-processes and found that as long as the value of worker-processes is consistent with the actual created worker process, there is no problem of intermittent network stuck.

@allandegnan

This comment was marked as outdated.

@allandegnan
Copy link

I tried to constantly adjust the value of worker-processes and found that as long as the value of worker-processes is consistent with the actual created worker process, there is no problem of intermittent network stuck.

Can confirm. My config had worker-processes at 16, but the container only had 8. By fixing the setting the issue goes away.

  • My setup is in the comment above marked as outdated.
  • It's quite different from the OPs. ( debian, podman, kind, cilium )
  • Rolling back just ingress-nginx did not fix the issue.

@tao12345666333
Copy link
Member

If your issue can be solved by adjusting worker-processes, then you need to consider issues such as load, network card, interruption, etc.

@github-actions
Copy link

github-actions bot commented Sep 7, 2023

This is stale, but we won't close it automatically, just bare in mind the maintainers may be busy with other tasks and will reach your issue ASAP. If you have any question or request to prioritize this, please reach #ingress-nginx-dev on Kubernetes Slack.

@github-actions github-actions bot added the lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. label Sep 7, 2023
@intonet
Copy link

intonet commented Feb 1, 2024

I tried to constantly adjust the value of worker-processes and found that as long as the value of worker-processes is consistent with the actual created worker process, there is no problem of intermittent network stuck.

The same here.
Problem occurs after bump nodes from 8 to 16vCPU. Setting worker process to 8 resolve problem.

For example, when the default value of worker-processes is auto(16), the nginx-controller network will be stuck intermittently,Because after testing, I found that only 13 worker processes were actually created, which may be the main cause of the problem.

I had 13 workers exactly as mention above.

my setup:
Proxmox 8.0.3
10x VM 16cpu 16GB ram, Ubuntu 22.04.3 LTS
K8S: v1.26.9+rke2r1 with cilium network plugin
ingress-controller installed by Helm chart v4.9.1 (nginx version: nginx/1.21.6)

@ataut-pai
Copy link

we are still hitting this, not sure why, but strangely the intermittent failures NEVER happen when we set a single replica count of 1. Unfortunately cannot find any relevant explanation to this behaviour

@longwuyuan
Copy link
Contributor

Issue was solved so closing.

/close

@k8s-ci-robot
Copy link
Contributor

@longwuyuan: Closing this issue.

In response to this:

Issue was solved so closing.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@ataut-pai
Copy link

ataut-pai commented Sep 16, 2024

@longwuyuan sorry, how it comes it was solved? can you please point us to the PR fixing this? Thank you! 💯

@longwuyuan
Copy link
Contributor

Adjusting workers as mentioned here #10276 (comment)

If it is not so, then kindly re-open the issue after posting the information that can be analyzed. Please use a kind cluster to reproduce the issue. Please use helm to install the controller and please provide the values file used to install the controller. You can also fork the project, create a branch and clone the branch locally. Then from root of local clone, you can run make dev-env to create cluster automatically with controller installed. Then you can do your tests locally and provide all the commands you executed and all the manifests you used etc etc so that a reader here can reproduce just like your test. thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. needs-kind Indicates a PR lacks a `kind/foo` label and requires one. needs-priority needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one.
Projects
Development

No branches or pull requests

7 participants