Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NGINX controller on kubernetes returns 404 "service not found" in stress-test scenario with ~200 ingress objects #9495

Closed
ctheodoropoulos opened this issue Jan 9, 2023 · 4 comments
Labels
needs-kind Indicates a PR lacks a `kind/foo` label and requires one. needs-priority needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one.

Comments

@ctheodoropoulos
Copy link

What happened:

I started receiving 404 "Not Found" responses from nginx when the number of ingress resources in the cluster is rising during a stress test(~200 new ingress objects in 2-3 minutes).
The ingresses are created by the deployment of seperate applications which contain an ingress resource.
In a normal scenario after 1-3 minutes(time for DNS record creation) nginx is serving my applications correctly using the created DNS.
In the stress test scenario there are 404 "Not Found" responses for more than 40 minutes before finally all applications are served correctly.
After the stress test with ~200 ingress objects existing, new ingress objects from deployments also return 404 responses for a significant period of time before they actually work.
I noticed from the logs that backend reloads need significant time(~15 minutes) when cluster has that many ingress resources active.

Background on the issue:

  • Initially the 404 responses started appearing now and then when my nginx controller configuration had no pod requests/limits. After introducing requests/limits the issue stopped appearing at normal load(~20-40 applications with ingresses)
  • Stress testing the cluster with more deployments in burst scenarios caused restarts and crashloopbackoffs for nginx controller pods mainly because of high memory pod needs. The solution to this was to use dedicated nodes and higher limits(4vcpu,12Gi) for the nginx pods.

Additional information:

  • The ingress objects are created/deleted dynamically based on applications that need to be deployed on the cluster.
  • There are 3 replica pods of the nginx controller that are deployed in 3 dedicated nodes
  • Data from prometheus/grafana show that cpu-ram usage of the pods is ~2vcpu, 8-10Gi at max load
  • Modsecurity(WAF) is enabled for each ingress object
  • The ingress objects are created during the deployment of different applications using HELM. The ingresses are tested and don't contain errors.

Is this behaviour normal with the opensource version of nginx or am I missing something that could help?

What you expected to happen:

After the DNS records are functional NGINX should service the applications without responding with 404 errors.
The 40+ minute wait time is not viable in my use-case.

NGINX Ingress controller version:

Kubernetes version:

  • Client Version: v1.25.0
  • Kustomize Version: v4.5.7
  • Server Version: v1.21.14-eks-fb459a0

Environment:

  • Cloud provider or hardware configuration: EKS

  • OS: Alpine Linux v3.14

  • Kernel: Linux nginx-controller-ingress-nginx-controller-85f55cdd6-n5s8r 5.4.209-116.367.amzn2.x86_64 Basic structure  #1 SMP Wed Aug 31 00:09:52 UTC 2022 x86_64 Linux

  • How was the ingress-nginx-controller installed: Using Helm

    • Helm deployment details:

      NAME                    NAMESPACE       REVISION        UPDATED                                 STATUS          CHART                   APP VERSION
      nginx-controller        nginx           2               2023-01-09 08:09:45.600814602 +0000 UTC deployed        ingress-nginx-4.0.19    1.1.3
      
    • Values:

      USER-SUPPLIED VALUES:
      controller:
      admissionWebhooks:
          enabled: false
          timeoutSeconds: 30
      config:
          max-worker-connections: "1024"
          proxy-body-size: 0m
          proxy-real-ip-cidr: 10.0.0.0/16
          server-snippet: |
          listen 8000;
          if ( $server_port = 80 ) {
              return 308 https://$host$request_uri;
          }
          ssl-redirect: "false"
          use-forwarded-headers: "true"
          worker-processes: "4"
      containerPort:
          http: 80
          https: 443
          special: 8000
      nodeSelector:
          purpose: nginx-workload
      replicaCount: 3
      resources:
          limits:
          cpu: 4000m
          memory: 12000Mi
          requests:
          cpu: 2000m
          memory: 6000Mi
      service:
          annotations:
          service.beta.kubernetes.io/aws-load-balancer-backend-protocol: tcp
          service.beta.kubernetes.io/aws-load-balancer-ssl-cert: arn:aws:acm:<redacted>
          service.beta.kubernetes.io/aws-load-balancer-ssl-ports: "443"
          service.beta.kubernetes.io/aws-load-balancer-type: nlb
          loadBalancerSourceRanges:
          - <redacted IP addresses>
          targetPorts:
          http: http
          https: special
      tolerations:
      - effect: NoSchedule
          key: dedicated
          operator: Equal
          value: nginx-group
      
@ctheodoropoulos ctheodoropoulos added the kind/bug Categorizes issue or PR as related to a bug. label Jan 9, 2023
@k8s-ci-robot k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Jan 9, 2023
@k8s-ci-robot
Copy link
Contributor

@ctheodoropoulos: This issue is currently awaiting triage.

If Ingress contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@longwuyuan
Copy link
Contributor

/remove-kind bug

  • Creating ingress is not the code of the ingress controller alone. The api-server, the network, other factors matter too. No useful comments can be made without all that data
  • Creating ingress or any other kubernetes object at breakneck speed (example infinite loop), is not tested by the project. Reason is the factor above that is not controlled by this project
  • Increase the speed of the network and the increase the cpu/memory type resources, after looking at monitoring data of the cluster. Resources available to the components involved would impact the speed at which those processes work, if that is the only goal
  • Kubectl has a wait feature with default time of 30s. It checks for state conditionally, of a given object. So use the wait feature and increase that timeout, so that subsequent events occur after previous one has achieved a state. That would at least make the tests feasible as compared to a infinite-loop or non-gitops type workflow
  • Provide data that hints at some possible action items on the ingress-nginx controller

Please reopen this issue after you have edited the original message and provided the information asked by the template, when you begin creating a new issue
There are no resources here dedicated to support. Better discuss this on kubernetes slack where there are lots of users and experts

/close

@k8s-ci-robot k8s-ci-robot added needs-kind Indicates a PR lacks a `kind/foo` label and requires one. and removed kind/bug Categorizes issue or PR as related to a bug. labels Jan 9, 2023
@k8s-ci-robot
Copy link
Contributor

@longwuyuan: Closing this issue.

In response to this:

/remove-kind bug

  • Creating ingress is not the code of the ingress controller alone. The api-server, the network, other factors matter too. No useful comments can be made without all that data
  • Creating ingress or any other kubernetes object at breakneck speed (example infinite loop), is not tested by the project. Reason is the factor above that is not controlled by this project
  • Increase the speed of the network and the increase the cpu/memory type resources, after looking at monitoring data of the cluster. Resources available to the components involved would impact the speed at which those processes work, if that is the only goal
  • Kubectl has a wait feature with default time of 30s. It checks for state conditionally, of a given object. So use the wait feature and increase that timeout, so that subsequent events occur after previous one has achieved a state. That would at least make the tests feasible as compared to a infinite-loop or non-gitops type workflow
  • Provide data that hints at some possible action items on the ingress-nginx controller

Please reopen this issue after you have edited the original message and provided the information asked by the template, when you begin creating a new issue
There are no resources here dedicated to support. Better discuss this on kubernetes slack where there are lots of users and experts

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@pawi
Copy link

pawi commented May 29, 2024

@ctheodoropoulos we had similar issue. During upscale of nginx-controller we received HTTP 404 responses from the default backend.

The nginx-controller pod reported already readiness through the /healthz endpoint. However around 2400 vHost entries are still created resulting in HTTP 404 responses.

We solved it by increasing the initialDelay for the readiness probe.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs-kind Indicates a PR lacks a `kind/foo` label and requires one. needs-priority needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one.
Projects
Archived in project
Development

No branches or pull requests

4 participants