Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upstream Prematurely Closed Connection While Reading Response Header From Upstream #11244

Closed
ChristianBieri1995 opened this issue Apr 10, 2024 · 15 comments
Labels
kind/support Categorizes issue or PR as a support question. needs-priority needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. triage/needs-information Indicates an issue needs more information in order to work on it.

Comments

@ChristianBieri1995
Copy link

ChristianBieri1995 commented Apr 10, 2024

What happened:

To route traffic into my Kubernetes cluster, I use an F5 load balancer, nginx ingress controllers (horizontal pod autoscaler) and k8s ingress objects. By doing so, I receive 502 HTTP status codes on a regular basis. Although low in percentage (0.003%), the errors must not be neglected as millions of requests are handled on a daily basis.

After having spent hours to detect the issue, I stumbled upon the following error: 2024/04/10 09:13:00 [error] 3980#3980: *29815576 upstream prematurely closed connection while reading response header from upstream

With that information at hand, I skimmed countless webpages to identify the root cause. After having checked the most obvious culprits such as

  • proxy_read/send/connect_timeout
  • proxy_set_header Connection "";
  • proxy_max_temp_file_size
  • X-Real-IP, X-Forwarded-For
  • proxy_buffer_size

I could not achieve any progress. As a next step, I activated the error-log-level: debug to catch the corresponding logs (see relevant parts below):

2024-03-26T16:21:04.076718973+01:00 stderr F 2024/03/26 15:21:04 [debug] 136#136: *943568 recv: fd:46 0 of 1048576
2024-03-26T16:21:04.076726784+01:00 stderr F 2024/03/26 15:21:04 [error] 136#136: *943568 upstream prematurely closed connection while reading response header from upstream (here, I left out some irrelevant parts)
2024-03-26T16:21:04.076750704+01:00 stderr F 2024/03/26 15:21:04 [debug] 136#136: *943568 http next upstream, 2
2024-03-26T16:21:04.076756887+01:00 stderr F 2024/03/26 15:21:04 [debug] 136#136: *943568 free keepalive peer
2024-03-26T16:21:04.076762948+01:00 stderr F 2024/03/26 15:21:04 [debug] 136#136: *943568 lua balancer free peer, tries: 2
2024-03-26T16:21:04.076768572+01:00 stderr F 2024/03/26 15:21:04 [debug] 136#136: *943568 finalize http upstream request: 502
2024-03-26T16:21:04.076774231+01:00 stderr F 2024/03/26 15:21:04 [debug] 136#136: *943568 finalize http proxy request
2024-03-26T16:21:04.07677987+01:00 stderr F 2024/03/26 15:21:04 [debug] 136#136: *943568 close http upstream connection: 46

Unfortunately, this did not help either but it indicated that there may be an issue with network connections that are reused (HTTP1.1). Therefore, I added nginx.ingress.kubernetes.io/proxy-http-version: "1.0" to the relevant k8s ingress object, and behold: No 502 HTTP status codes anymore. I could not only replicate this behaviour on my test environment, but also on more relevant stages.

In my view, there seems to be an issue with reusing established connections, coming along with HTTP1.1 - probably with my nginx.conf.

NGINX Ingress version: nginx version: nginx/1.21.6

Kubernetes version:
Client Version: v1.28.2
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.28.5+rke2r1

Environment:

  • Cloud provider or hardware configuration: Rancher, rke2

  • OS : NAME="SLES", VERSION="15-SP5", VERSION_ID="15.5"

  • Kernel: Linux 5.14.21-150500.55.52-default Basic structure  #1 SMP PREEMPT_DYNAMIC Tue Mar 5 16:53:41 UTC 2024 (a62851f) x86_64

  • Install tools:

    • Terraform
  • Basic cluster related info:

    • 20 k8s worker nodes, 8vCores/64GB RAM or 32vCores/256GB RAM
    • 3 k8s controlplane nodes
  • How was the ingress-nginx-controller installed:

    • The nginx ingress controllers are deployed by a deployment which makes use of a horizontal pod autoscaler
    • The installation of the nginx ingress controllers was conducted via helm/helmfile and the following values.yaml was used (not exhaustive):
controller:
  config:
    client-header-buffer-size: "1024k"
    large-client-header-buffers: "4 1024k"
    proxy-buffer-size: "1024k"
    proxy-send-timeout: "90"
    proxy-read-timeout: "200"
    use-proxy-protocol: "true"
    force-ssl-redirect: "true"
    log-format-escaping: json
    log-format-upstream: '{"time": "$time_iso8601", "remote_addr": "$remote_addr", "remote_user": "$remote_user", "request": "$request",
      "response_status": $status, "body_bytes_sent": $body_bytes_sent, "http_referer": "$http_referer", "http_user_agent": "$http_user_agent",
      "request_length": $request_length, "request_time": $request_time, "proxy_upstream_name": "[$proxy_upstream_name]",
      "proxy_alternative_upstream_name": "[$proxy_alternative_upstream_name]", "upstream_addr": "$upstream_addr", "upstream_response_length": $upstream_response_length,
      "upstream_response_time": $upstream_response_time, "upstream_status": $upstream_status, "req_id": "$req_id"}'

  extraArgs:
    enable-ssl-passthrough: ""

  replicaCount: 3
  autoscaling:
    enabled: true
    minReplicas: 3
    maxReplicas: 4
    targetCPUUtilizationPercentage: 50
    targetMemoryUtilizationPercentage: 90

  resources:
    limits:
      cpu: 800m
      memory: 800Mi
    requests:
      cpu: 100m
      memory: 300Mi

  service:
    type: NodePort
    externalTrafficPolicy: Local
    nodePorts:
      http: 32080
      https: 32443
  • Current State of the controller:
    • ingressclasses.networking.k8s.io:
Name:         nginx
Labels:       app.kubernetes.io/component=controller
              app.kubernetes.io/instance=ingress
              app.kubernetes.io/managed-by=Helm
              app.kubernetes.io/name=ingress-nginx
              app.kubernetes.io/part-of=ingress-nginx
              app.kubernetes.io/version=1.9.6
              helm.sh/chart=ingress-nginx-4.9.1
Annotations:  argocd.argoproj.io/tracking-id: infra-ingress.infra-ingress:networking.k8s.io/IngressClass:infra-ingress-tst/nginx
Controller:   k8s.io/ingress-nginx
Events:       <none>
  • kubectl -n <ingresscontrollernamespace> get all -A -o wide:
NAME                                                    READY   STATUS    RESTARTS      AGE
pod/ingress-nginx-tst-controller-5bdbb97d9-dv8kb        1/1     Running   1 (16d ago)   21d
pod/ingress-nginx-tst-defaultbackend-76d48c597b-265xt   1/1     Running   0             21d

NAME                                           TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)                      AGE
service/ingress-nginx-tst-controller           NodePort    x.x.x.x          <none>        80:32080/TCP,443:32443/TCP   176d
service/ingress-nginx-tst-controller-metrics   ClusterIP   x.x.x.x          <none>        10254/TCP                    176d
service/ingress-nginx-tst-defaultbackend       ClusterIP   x.x.x.x          <none>        80/TCP                       176d

NAME                                               READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/ingress-nginx-tst-controller       1/1     1            1           176d
deployment.apps/ingress-nginx-tst-defaultbackend   1/1     1            1           176d

NAME                                                          DESIRED   CURRENT   READY   AGE
replicaset.apps/ingress-nginx-tst-controller-5bdbb97d9        1         1         1       58d
replicaset.apps/ingress-nginx-tst-defaultbackend-76d48c597b   1         1         1       58d

NAME                                                               REFERENCE                                 TARGETS            MINPODS   MAXPODS   REPLICAS   AGE
horizontalpodautoscaler.autoscaling/ingress-nginx-tst-controller   Deployment/ingress-nginx-tst-controller   62%/90%, 10%/50%   1         1         1          176d

--> Please note that the replicas/HPA values were set to 1 for easier identification of the issues described. The issue also arises with multiple replicas, however.


  • kubectl -n <ingresscontrollernamespace> describe pod <ingresscontrollerpodname>:
Name:             ingress-nginx-tst-controller-5bdbb97d9-dv8kb
Namespace:        infra-ingress-tst
Priority:         0
Service Account:  default
Node:             x.x.x.x
Start Time:       Wed, 03 Apr 2024 14:23:15 +0200
Labels:           app.kubernetes.io/component=controller
                  app.kubernetes.io/instance=ingress
                  app.kubernetes.io/managed-by=Helm
                  app.kubernetes.io/name=ingress-nginx
                  app.kubernetes.io/part-of=ingress-nginx
                  app.kubernetes.io/version=1.9.6
                  helm.sh/chart=ingress-nginx-4.9.1
                  pod-template-hash=5bdbb97d9
Annotations:      container.apparmor.security.beta.kubernetes.io/controller: localhost/k8s-ingress-controller
                  kubernetes.io/limit-ranger:
                    LimitRanger plugin set: cpu, memory request for init container opentelemetry; cpu, memory limit for init container opentelemetry
Status:           Running
IP:               x.x.x.x
IPs:
  IP:           x.x.x.x
Controlled By:  ReplicaSet/ingress-nginx-tst-controller-5bdbb97d9
Init Containers:
  opentelemetry:
    Container ID:    containerd://ab2e25842abdd00da17f3404fcea0842961be8301430fe70a7f7e9ff4ee9e2e4
    Image:           x/ingress-nginx/opentelemetry:v20230721-3e2062ee5@sha256:13bee3f5223883d3ca62fee7309ad02d22ec00ff0d7033e3e9aca7a9f60fd472
    Image ID:        x/ingress-nginx/opentelemetry@sha256:13bee3f5223883d3ca62fee7309ad02d22ec00ff0d7033e3e9aca7a9f60fd472
    Port:            <none>
    Host Port:       <none>
    SeccompProfile:  RuntimeDefault
    Command:
      /init_module
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Wed, 03 Apr 2024 14:23:15 +0200
      Finished:     Wed, 03 Apr 2024 14:23:15 +0200
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:     250m
      memory:  100Mi
    Requests:
      cpu:        100m
      memory:     50Mi
    Environment:  <none>
    Mounts:
      /modules_mount from modules (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-54wkm (ro)
Containers:
  controller:
    Container ID:    containerd://1461ce776d455eab59334df36faf03b0c19080e6efe727a4c940271429706bba
    Image:           x/ingress-nginx/controller:v1.9.6@sha256:1405cc613bd95b2c6edd8b2a152510ae91c7e62aea4698500d23b2145960ab9c
    Image ID:        x/ingress-nginx/controller@sha256:1405cc613bd95b2c6edd8b2a152510ae91c7e62aea4698500d23b2145960ab9c
    Ports:           80/TCP, 443/TCP, 10254/TCP
    Host Ports:      0/TCP, 0/TCP, 0/TCP
    SeccompProfile:  RuntimeDefault
    Args:
      /nginx-ingress-controller
      --default-backend-service=$(POD_NAMESPACE)/ingress-nginx-tst-defaultbackend
      --publish-service=$(POD_NAMESPACE)/ingress-nginx-tst-controller
      --election-id=ingress-nginx-tst-leader
      --controller-class=k8s.io/ingress-nginx
      --ingress-class=nginx
      --configmap=$(POD_NAMESPACE)/ingress-nginx-tst-controller
      --watch-ingress-without-class=true
      --default-ssl-certificate=infra-ingress-tst/tls-certificates-tst-wildcard
      --enable-ssl-passthrough
    State:          Running
      Started:      Tue, 09 Apr 2024 07:18:25 +0200
    Last State:     Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Wed, 03 Apr 2024 14:23:16 +0200
      Finished:     Tue, 09 Apr 2024 07:18:25 +0200
    Ready:          True
    Restart Count:  1
    Limits:
      cpu:     800m
      memory:  800Mi
    Requests:
      cpu:      100m
      memory:   300Mi
    Liveness:   http-get http://:10254/healthz delay=10s timeout=1s period=10s #success=1 #failure=5
    Readiness:  http-get http://:10254/healthz delay=10s timeout=1s period=10s #success=1 #failure=3
    Environment:
      POD_NAME:       ingress-nginx-tst-controller-5bdbb97d9-dv8kb (v1:metadata.name)
      POD_NAMESPACE:  infra-ingress-tst (v1:metadata.namespace)
      LD_PRELOAD:     /usr/local/lib/libmimalloc.so
    Mounts:
      /modules_mount from modules (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-54wkm (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             True
  ContainersReady   True
  PodScheduled      True
Volumes:
  modules:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
    SizeLimit:  <unset>
  kube-api-access-54wkm:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              kubernetes.io/os=linux
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:                      <none>

  • kubectl -n <ingresscontrollernamespace> describe svc <ingresscontrollerservicename>:
apiVersion: v1
kind: Service
metadata:
  creationTimestamp: "2023-10-31T13:17:04Z"
  labels:
    app.kubernetes.io/component: controller
    app.kubernetes.io/instance: ingress
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: ingress-nginx
    app.kubernetes.io/part-of: ingress-nginx
    app.kubernetes.io/version: 1.9.6
    helm.sh/chart: ingress-nginx-4.9.1
  name: ingress-nginx-tst-controller
  namespace: infra-ingress-tst
  resourceVersion: "175091575"
  uid: 14205f77-5c71-4e4b-ad3a-f258b01888b6
spec:
  clusterIP: x.x.x.x
  clusterIPs:
  - x.x.x.x
  externalTrafficPolicy: Local
  internalTrafficPolicy: Cluster
  ipFamilies:
  - IPv4
  ipFamilyPolicy: SingleStack
  ports:
  - appProtocol: http
    name: http
    nodePort: 32080
    port: 80
    protocol: TCP
    targetPort: http
  - appProtocol: https
    name: https
    nodePort: 32443
    port: 443
    protocol: TCP
    targetPort: https
  selector:
    app.kubernetes.io/component: controller
    app.kubernetes.io/instance: ingress
    app.kubernetes.io/name: ingress-nginx
  sessionAffinity: None
  type: NodePort
status:
  loadBalancer: {}

  • Current state of ingress object, if applicable:
    • kubectl -n <appnamespace> describe ing <ingressname>:
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  annotations:
  creationTimestamp: "2024-04-03T09:42:21Z"
  generation: 1
  labels:
    app: bs-simplesample-az-srv
    app.kubernetes.io/instance: bs-simplesample-az-srv
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: bs-simplesample-az-srv
    chart: generic-openliberty-2.3.4
    helm.sh/chart: generic-openliberty-2.3.4
    heritage: Helm
    release: bs-simplesample-az-srv
  name: bs-simplesample-az-srv-tst-business-services
  namespace: kern-sample-bsfw-tst
  resourceVersion: "181093092"
  uid: 0c38ad0b-d4a4-4ecc-9db5-183b63143b4e
spec:
  rules:
  - host: x
    http:
      paths:
      - backend:
          service:
            name: bs-simplesample-az-srv-tst
            port:
              name: http
        path: /sample-bsfw-simplesampleaz-ws/
        pathType: Prefix
  tls:
  - hosts:
    - x
    secretName: tls-certificates-tst-x
status:
  loadBalancer:
    ingress:
    - ip: x.x.x.x
  • Others:
    • nginx.conf
reset_timedout_connection on;
keepalive_timeout  75s;
keepalive_requests 1000;

May there be an issue with timeouts related to established connections?

How to reproduce this issue: Simply by accessing the backend via F5 and nginx ingress controller. To do so, I created a load tester.

@ChristianBieri1995 ChristianBieri1995 added the kind/bug Categorizes issue or PR as related to a bug. label Apr 10, 2024
@k8s-ci-robot k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Apr 10, 2024
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If Ingress contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@longwuyuan
Copy link
Contributor

/remove-kind bug

  • Using a F5 LB and sending traffic manually to ingress-controller over a NodePort is not tested in CI
  • Most common use-case where the debugging of L7 nuances becomes viable is when a service-type LoadBalancer is created by the installation of the ingress-nginx controller and which in turn triggers a LB to be provisioned by the infra-provider like the cloud-provider or a alternative like Metallb
  • Yes the timeouts configured on the controller have a impact if the use-case matches but troubleshooting such use-case is not the same as referring to a CI test
  • Edit the issue description and provide information as asked in a new issue template. That way at least a reader gets the related info to analyse

/triage needs-information
/kind support

@k8s-ci-robot k8s-ci-robot added triage/needs-information Indicates an issue needs more information in order to work on it. kind/support Categorizes issue or PR as a support question. and removed kind/bug Categorizes issue or PR as related to a bug. labels Apr 10, 2024
@ChristianBieri1995
Copy link
Author

Thank you for the prompty reply.

  • Using a F5 LB and sending traffic manually to ingress-controller over a NodePort is not tested in CI
  • Most common use-case where the debugging of L7 nuances becomes viable is when a service-type LoadBalancer is created by the installation of the ingress-nginx controller and which in turn triggers a LB to be provisioned by the infra-provider like the cloud-provider or a alternative like Metallb

Sorry for the confusion, I only added this information for people to better understand my cluster architecture. The issue arises in the nginx ingress controller.

  • Yes the timeouts configured on the controller have a impact if the use-case matches but troubleshooting such use-case is not the same as referring to a CI test
  • Edit the issue description and provide information as asked in a new issue template. That way at least a reader gets the related info to analyse

You mention "provide information as asked" but I can't find which information was asked for? And where should I open a new issue and with which template?

Thank you already in advance for your reply.

@longwuyuan
Copy link
Contributor

  • Do not create a new bug or issue. Just read the template

    • Click open new bug report
    • Look at the default template of the new bug report
    • Then edit your issue description and answer the questions asked in the template in this issue's description
  • The information about the F5 & controller-service-type-NodePort is important and it was required so thank you. But the implication is that if you install this project's ingress-nginx controller as per documentation and so create a service-type LoadBalancer, then bare-metal environment is better if you have deployed Metallb.org as that is widely used and you can expect other Metallb users to comment. If you create a service type NodePort then the termination of the connection is on the F5 and not on the controller. So that is a very very very different use-case and not tested so you are in control of how the L3/4 and L7 traffic behaves.

  • You can expect more involvement if you terminate the connection from external-clients on the ingress-controller. So now you have to wait for experts who know about terminating connection on a F5 and then establishing new connection from F5 to NodePort etc etc etc

  • If the controller dropped/denied/interrupted connections and was the root-cause of 502 response, then yo uhave to provide detailed step by step instruction on how someone can reproduce the problem on a minikube cluster or a kind cluster. Otherwise it becomes difficult to know which part of the ingress-nginx controller code is causing the problem

@ChristianBieri1995
Copy link
Author

Thank you for clarification, I will update it asap!

@ChristianBieri1995
Copy link
Author

  • Do not create a new bug or issue. Just read the template

    • Click open new bug report
    • Look at the default template of the new bug report
    • Then edit your issue description and answer the questions asked in the template in this issue's description

Done, I have provided as much information as possible related to the issue described above.

  • The information about the F5 & controller-service-type-NodePort is important and it was required so thank you. But the implication is that if you install this project's ingress-nginx controller as per documentation and so create a service-type LoadBalancer, then bare-metal environment is better if you have deployed Metallb.org as that is widely used and you can expect other Metallb users to comment. If you create a service type NodePort then the termination of the connection is on the F5 and not on the controller. So that is a very very very different use-case and not tested so you are in control of how the L3/4 and L7 traffic behaves.

When I understand you correctly, you recommend switching to service type loadBalancer, leveraging metallb? However, Using service Type NodePort seems to be supportet as well according to https://docs.nginx.com/nginx-ingress-controller/installation/installing-nic/installation-with-manifests/#option-1-create-a-nodeport-service. Is the problem that we use the F5 in combination with the service type NodePort?

  • You can expect more involvement if you terminate the connection from external-clients on the ingress-controller. So now you have to wait for experts who know about terminating connection on a F5 and then establishing new connection from F5 to NodePort etc etc etc

I am not aware whether termination on the ingress controllers is possible. I have to check this internally with my team. Nonetheless, does it even matter what happens "before" the ingress controller as the issue seems to arise when connections between the ingress controllers and the backends are established/reused?

  • If the controller dropped/denied/interrupted connections and was the root-cause of 502 response, then yo uhave to provide detailed step by step instruction on how someone can reproduce the problem on a minikube cluster or a kind cluster. Otherwise it becomes difficult to know which part of the ingress-nginx controller code is causing the problem

Unfortunately, I am not able to provide any further information as the one above. However, the problem seems to arise on a regular basis (0.003% of all requests), independent of the backend service. Therefore, with the setup described above it should be recreatable anyways.

@longwuyuan
Copy link
Contributor

My point was that you terminate on F5 and then establish a new connection from F5 to ingress-nginx. This is not tested in the project and the relevance here is that there is no official stand on how your F5 works for Layer3/4 and Layer7. Nobody in the project can or will test this for you or try to reproduce it for you.

Other users of F5 in front of NodePort service of ingress-nginx-controller may know better and comment here.

@ChristianBieri1995
Copy link
Author

Got it, let's see if there is someone who can relate to the issue stated.

@ChristianBieri1995
Copy link
Author

My point was that you terminate on F5 and then establish a new connection from F5 to ingress-nginx. This is not tested in the project and the relevance here is that there is no official stand on how your F5 works for Layer3/4 and Layer7. Nobody in the project can or will test this for you or try to reproduce it for you.

Other users of F5 in front of NodePort service of ingress-nginx-controller may know better and comment here.

Just to add on that:

  • The TCP connection is terminated on the F5 and reestablished in order to connect with the ingress controller(s), correct. However, the TLS termination takes place on the nginx ingress controllers within the cluster (as this is also the place with access to the certificates, residing within the k8s secrets)
  • I still do not understand what role/relevance the F5 really has as the issue seems to arise when establishing/reusing connections from the ingress controllers towards the backend(s). Based on this assumption, I do not get why a different load balancer (in this case metallb) should resolve the issue. Do you have any insights/further information on that?
  • Last but not least, I would again like to mention that this issue does not pop up when using http-version 1.0. The main difference there, according to my understanding at least, is the fact that established connections are not being reused - therefore, I strongly assume that the issue something has to do with reusing connections

@longwuyuan
Copy link
Contributor

longwuyuan commented Apr 26, 2024

Sorry if my comments are not clear. The way I understand this is ;

  • You use F5 at the edge and a service of type NodePort for the controller. But the project CI does not test this kind of use-case. So how a HTTP/HTTPS requests behaves in this use-case is unknown. Its possible to make guesses but comments based on guesses is not appropriate since you have a problem that needs to be solved.

  • Its great to see multiple ways the controller is being used but there is no data to look up in the CI logs related to this use-case and its not possible to test this or reproduce this in the project. Maybe other users of F5+NodePort can comment and thus add value

  • Its not possible to discuss the connection to the controller at a port like 30000, coming from a F5 because how the F5 handles Layer3/4/7 traffic and terminates HTTP/HTTPS on a host's NodePort like 30000, for the same reason that its not tested in CI and may or may not relate to errors/problems, occuring beyond that NodePort 30000 of the host tcp/ip stack.

  • The different LoadBalancer like metallb provisions an external-IP for the service --type LoadBalancer created by the ingress-nginx controller. Without diving deeper, this is a marked different networking then your use-case, albeit there are similarities in L3/4 routing. Point being its possible to comment on metallb use-case as how HTTP/HTTPS requests work is known. But its hard to comment on this F5 use case as I don't know how that works in detail for HTTP/HTTPS

  • The base of investigation here is error code being 502. So all the above factors influence who sent the 502. The controller logs may have some 502 indications so based on that a live check has to be done for the upstream that sent that 502. And once again its possible to guess that there could have been a momentary resource-starvation to see a 502 but only you can check that live, when the 502 occurs. The resources involved could be CPU, Memory, Inodes, Conntrack etc. Or the backend pod could have been just too busy processing and delayed its response. Here the point being if there is a fault in the controller then it should be reproducible on a minikube or a kind cluster. So I can't reproduce 502 on a normal test and the CI does not show 502 in the e2e-tests. Hence am not certain where we go from here

@ChristianBieri1995
Copy link
Author

Sorry if my comments are not clear. The way I understand this is ;

  • You use F5 at the edge and a service of type NodePort for the controller. But the project CI does not test this kind of use-case. So how a HTTP/HTTPS requests behaves in this use-case is unknown. Its possible to make guesses but comments based on guesses is not appropriate since you have a problem that needs to be solved.

Sounds legit and is totally understandable

  • Its great to see multiple ways the controller is being used but there is no data to look up in the CI logs related to this use-case and its not possible to test this or reproduce this in the project. Maybe other users of F5+NodePort can comment and thus add value

Let's see if there is anyone in the community, using F5, NodePort Service and Nginx ingress controller

  • Its not possible to discuss the connection to the controller at a port like 30000, coming from a F5 because how the F5 handles Layer3/4/7 traffic and terminates HTTP/HTTPS on a host's NodePort like 30000, for the same reason that its not tested in CI and may or may not relate to errors/problems, occuring beyond that NodePort 30000 of the host tcp/ip stack.
  • The different LoadBalancer like metallb provisions an external-IP for the service --type LoadBalancer created by the ingress-nginx controller. Without diving deeper, this is a marked different networking then your use-case, albeit there are similarities in L3/4 routing. Point being its possible to comment on metallb use-case as how HTTP/HTTPS requests work is known. But its hard to comment on this F5 use case as I don't know how that works in detail for HTTP/HTTPS

Got it, thank you for clarification!

  • The base of investigation here is error code being 502. So all the above factors influence who sent the 502. The controller logs may have some 502 indications so based on that a live check has to be done for the upstream that sent that 502. And once again its possible to guess that there could have been a momentary resource-starvation to see a 502 but only you can check that live, when the 502 occurs. The resources involved could be CPU, Memory, Inodes, Conntrack etc. Or the backend pod could have been just too busy processing and delayed its response. Here the point being if there is a fault in the controller then it should be reproducible on a minikube or a kind cluster. So I can't reproduce 502 on a normal test and the CI does not show 502 in the e2e-tests. Hence am not certain where we go from here

Already checked resource-starvation and backend pod processing. I fully understand the notes you made about CI and reproduction though. I guess the only thing we can do right now is to wait for someone who knows more about F5 + NodePort + Nginx ingress controller

@longwuyuan
Copy link
Contributor

@ChristianBieri1995 there is no data here that points at a problem to be solved in the ingress-nginx controller. Additionally I know that the 502s are reported by others in other issues and its come from timeouts or backend-pod design or load etc. The controller is helpless if the upstream responds with a 502.

Also, I think you will get more eyes of F5 users if you message in the Kubernetes Slack in channels like the ingress-nginx-users or plain old kubernetes-users.

We have 490ish issues open and not all are tracking a problem to be solved in the ingress-nginx controller per se. So I will close the issue for now but feel-free to re-open the issue when there is data posted here that indicates work to be done on the controller itself. It will help manage issue better for us.

/close

@k8s-ci-robot
Copy link
Contributor

@longwuyuan: Closing this issue.

In response to this:

@ChristianBieri1995 there is no data here that points at a problem to be solved in the ingress-nginx controller. Additionally I know that the 502s are reported by others in other issues and its come from timeouts or backend-pod design or load etc. The controller is helpless if the upstream responds with a 502.

Also, I think you will get more eyes of F5 users if you message in the Kubernetes Slack in channels like the ingress-nginx-users or plain old kubernetes-users.

We have 490ish issues open and not all are tracking a problem to be solved in the ingress-nginx controller per se. So I will close the issue for now but feel-free to re-open the issue when there is data posted here that indicates work to be done on the controller itself. It will help manage issue better for us.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@ChristianBieri1995
Copy link
Author

Dear @longwuyuan,

Finally, we were able to solve the issue (we think so at least) which arose because of a timeout mismatch (when reusing sessions, HTTP 1.1) between the Nginx ingress controller and our OpenLiberty backend.

Within the Nginx ingress controller, the following (relevant) parameters (related to reusage of sessions) are set: keepalive_timeout 75s; keepalive_requests 1000;

In the OpenLiberty backend, the "persistTimeout" configuration was set to 60s.

As a result, whenever a session which was not used > "persistTimeout" is being reused by the Nginx ingress controller, the backend is no longer aware of these sessions (the session was prematurely closed). Even though you indicated multiple times that this is not an issue of the Nginx ingress controller, I am not totally sure about it. Should the Nginx ingress controller not be informed if sessions are being closed by the OpenLiberty backend, thus, the session not being reused by the Nginx ingress controller? As far as I know, we are not the only victims of this (in my view faulty) behaviour.

I am really looking forward to your feedback.

@zeppelinux
Copy link

I'm getting very similar error:
[error] 45#45: *12827043 upstream prematurely closed connection while reading response header from upstream

My Nginx ingress controller communicates (sending POST request) with the Wordpress/Printful WooCommerce Integration/3.0.0 backend

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/support Categorizes issue or PR as a support question. needs-priority needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. triage/needs-information Indicates an issue needs more information in order to work on it.
Projects
Development

No branches or pull requests

4 participants