NGINX controller on kubernetes returns 404 "service not found" in stress-test scenario with ~200 ingress objects #9495

ctheodoropoulos · 2023-01-09T15:03:17Z

What happened:

I started receiving 404 "Not Found" responses from nginx when the number of ingress resources in the cluster is rising during a stress test(~200 new ingress objects in 2-3 minutes).
The ingresses are created by the deployment of seperate applications which contain an ingress resource.
In a normal scenario after 1-3 minutes(time for DNS record creation) nginx is serving my applications correctly using the created DNS.
In the stress test scenario there are 404 "Not Found" responses for more than 40 minutes before finally all applications are served correctly.
After the stress test with ~200 ingress objects existing, new ingress objects from deployments also return 404 responses for a significant period of time before they actually work.
I noticed from the logs that backend reloads need significant time(~15 minutes) when cluster has that many ingress resources active.

Background on the issue:

Initially the 404 responses started appearing now and then when my nginx controller configuration had no pod requests/limits. After introducing requests/limits the issue stopped appearing at normal load(~20-40 applications with ingresses)
Stress testing the cluster with more deployments in burst scenarios caused restarts and crashloopbackoffs for nginx controller pods mainly because of high memory pod needs. The solution to this was to use dedicated nodes and higher limits(4vcpu,12Gi) for the nginx pods.

Additional information:

The ingress objects are created/deleted dynamically based on applications that need to be deployed on the cluster.
There are 3 replica pods of the nginx controller that are deployed in 3 dedicated nodes
Data from prometheus/grafana show that cpu-ram usage of the pods is ~2vcpu, 8-10Gi at max load
Modsecurity(WAF) is enabled for each ingress object
The ingress objects are created during the deployment of different applications using HELM. The ingresses are tested and don't contain errors.

Is this behaviour normal with the opensource version of nginx or am I missing something that could help?

What you expected to happen:

After the DNS records are functional NGINX should service the applications without responding with 404 errors.
The 40+ minute wait time is not viable in my use-case.

NGINX Ingress controller version:

Release: v1.1.3
Build: 9d3a285
Repository: https://github.com/kubernetes/ingress-nginx
nginx version: nginx/1.19.10

Kubernetes version:

Client Version: v1.25.0
Kustomize Version: v4.5.7
Server Version: v1.21.14-eks-fb459a0

Environment:

Cloud provider or hardware configuration: EKS
OS: Alpine Linux v3.14
Kernel: Linux nginx-controller-ingress-nginx-controller-85f55cdd6-n5s8r 5.4.209-116.367.amzn2.x86_64 Basic structure #1 SMP Wed Aug 31 00:09:52 UTC 2022 x86_64 Linux

How was the ingress-nginx-controller installed: Using Helm

Helm deployment details:

NAME                    NAMESPACE       REVISION        UPDATED                                 STATUS          CHART                   APP VERSION
nginx-controller        nginx           2               2023-01-09 08:09:45.600814602 +0000 UTC deployed        ingress-nginx-4.0.19    1.1.3

Values:

USER-SUPPLIED VALUES:

controller:
admissionWebhooks:
    enabled: false
    timeoutSeconds: 30
config:
    max-worker-connections: "1024"
    proxy-body-size: 0m
    proxy-real-ip-cidr: 10.0.0.0/16
    server-snippet: |
    listen 8000;
    if ( $server_port = 80 ) {
        return 308 https://$host$request_uri;
    }
    ssl-redirect: "false"
    use-forwarded-headers: "true"
    worker-processes: "4"
containerPort:
    http: 80
    https: 443
    special: 8000
nodeSelector:
    purpose: nginx-workload
replicaCount: 3
resources:
    limits:
    cpu: 4000m
    memory: 12000Mi
    requests:
    cpu: 2000m
    memory: 6000Mi
service:
    annotations:
    service.beta.kubernetes.io/aws-load-balancer-backend-protocol: tcp
    service.beta.kubernetes.io/aws-load-balancer-ssl-cert: arn:aws:acm:<redacted>
    service.beta.kubernetes.io/aws-load-balancer-ssl-ports: "443"
    service.beta.kubernetes.io/aws-load-balancer-type: nlb
    loadBalancerSourceRanges:
    - <redacted IP addresses>
    targetPorts:
    http: http
    https: special
tolerations:
- effect: NoSchedule
    key: dedicated
    operator: Equal
    value: nginx-group

The text was updated successfully, but these errors were encountered:

k8s-ci-robot · 2023-01-09T15:03:30Z

@ctheodoropoulos: This issue is currently awaiting triage.

If Ingress contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

longwuyuan · 2023-01-09T17:51:18Z

/remove-kind bug

Creating ingress is not the code of the ingress controller alone. The api-server, the network, other factors matter too. No useful comments can be made without all that data
Creating ingress or any other kubernetes object at breakneck speed (example infinite loop), is not tested by the project. Reason is the factor above that is not controlled by this project
Increase the speed of the network and the increase the cpu/memory type resources, after looking at monitoring data of the cluster. Resources available to the components involved would impact the speed at which those processes work, if that is the only goal
Kubectl has a wait feature with default time of 30s. It checks for state conditionally, of a given object. So use the wait feature and increase that timeout, so that subsequent events occur after previous one has achieved a state. That would at least make the tests feasible as compared to a infinite-loop or non-gitops type workflow
Provide data that hints at some possible action items on the ingress-nginx controller

Please reopen this issue after you have edited the original message and provided the information asked by the template, when you begin creating a new issue
There are no resources here dedicated to support. Better discuss this on kubernetes slack where there are lots of users and experts

/close

k8s-ci-robot · 2023-01-09T17:51:24Z

@longwuyuan: Closing this issue.

In response to this:

/remove-kind bug

Creating ingress is not the code of the ingress controller alone. The api-server, the network, other factors matter too. No useful comments can be made without all that data

Creating ingress or any other kubernetes object at breakneck speed (example infinite loop), is not tested by the project. Reason is the factor above that is not controlled by this project

Increase the speed of the network and the increase the cpu/memory type resources, after looking at monitoring data of the cluster. Resources available to the components involved would impact the speed at which those processes work, if that is the only goal

Kubectl has a wait feature with default time of 30s. It checks for state conditionally, of a given object. So use the wait feature and increase that timeout, so that subsequent events occur after previous one has achieved a state. That would at least make the tests feasible as compared to a infinite-loop or non-gitops type workflow

Provide data that hints at some possible action items on the ingress-nginx controller

Please reopen this issue after you have edited the original message and provided the information asked by the template, when you begin creating a new issue
There are no resources here dedicated to support. Better discuss this on kubernetes slack where there are lots of users and experts

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

pawi · 2024-05-29T21:02:34Z

@ctheodoropoulos we had similar issue. During upscale of nginx-controller we received HTTP 404 responses from the default backend.

The nginx-controller pod reported already readiness through the /healthz endpoint. However around 2400 vHost entries are still created resulting in HTTP 404 responses.

We solved it by increasing the initialDelay for the readiness probe.

ctheodoropoulos added the kind/bug Categorizes issue or PR as related to a bug. label Jan 9, 2023

k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Jan 9, 2023

k8s-ci-robot added the needs-priority label Jan 9, 2023

strongjz added this to [SIG Network] Ingress NGINX Jan 9, 2023

k8s-ci-robot added needs-kind Indicates a PR lacks a `kind/foo` label and requires one. and removed kind/bug Categorizes issue or PR as related to a bug. labels Jan 9, 2023

k8s-ci-robot closed this as completed Jan 9, 2023

github-project-automation bot moved this to Done in [SIG Network] Ingress NGINX Jan 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NGINX controller on kubernetes returns 404 "service not found" in stress-test scenario with ~200 ingress objects #9495

NGINX controller on kubernetes returns 404 "service not found" in stress-test scenario with ~200 ingress objects #9495

ctheodoropoulos commented Jan 9, 2023

k8s-ci-robot commented Jan 9, 2023

longwuyuan commented Jan 9, 2023

k8s-ci-robot commented Jan 9, 2023

pawi commented May 29, 2024

NGINX controller on kubernetes returns 404 "service not found" in stress-test scenario with ~200 ingress objects #9495

NGINX controller on kubernetes returns 404 "service not found" in stress-test scenario with ~200 ingress objects #9495

Comments

ctheodoropoulos commented Jan 9, 2023

k8s-ci-robot commented Jan 9, 2023

longwuyuan commented Jan 9, 2023

k8s-ci-robot commented Jan 9, 2023

pawi commented May 29, 2024