proxy-next-upstream (including default on error and timeout) does not always pick a different upstream depending on load balancer and concurrent requests #11852

marvin-roesch · 2024-08-23T09:18:06Z

What happened:
When one backend pod fails under a condition covered by proxy_next_upstream (e.g. http_404 for easy testing), if there's a large volume of requests, any one request may reuse the same backend for all tries rather than actually using the "next" backend. This happens for sure with the default round-robin balancer, but most likely with all balancer implementations.

What you expected to happen:
If a backend request fails due to one of the proxy_next_upstream conditions, it should be retried with at least one of the other available backends, regardless of the configured load balancer or any concurrent requests.

NGINX Ingress controller version (exec into the pod and run nginx-ingress-controller --version.): 1.11.2

Kubernetes version (use kubectl version): 1.28.10

Environment:

Cloud provider or hardware configuration: MacBook Pro with Apple M2
OS (e.g. from /etc/os-release): Ubuntu 22.04.4 via Multipass on macOS 14.5
Kernel (e.g. uname -a): 5.15.0-119-generic
Install tools:
- microk8s

Basic cluster related info:

kubectl version

Client Version: v1.31.0
Kustomize Version: v5.4.2
Server Version: v1.28.10

kubectl get nodes -o wide

NAME          STATUS   ROLES    AGE   VERSION    INTERNAL-IP    EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION       CONTAINER-RUNTIME
microk8s-vm   Ready    <none>   24h   v1.28.10   192.168.64.9   <none>        Ubuntu 22.04.4 LTS   5.15.0-119-generic   containerd://1.6.28

How was the ingress-nginx-controller installed:
- microk8s enable ingress

Current State of the controller:

kubectl describe ingressclasses

Name:         public
Labels:       <none>
Annotations:  ingressclass.kubernetes.io/is-default-class: true
Controller:   k8s.io/ingress-nginx
Events:       <none>

Name:         nginx
Labels:       <none>
Annotations:  <none>
Controller:   k8s.io/ingress-nginx
Events:       <none>

kubectl -n ingress get all -o wide

NAME                                          READY   STATUS    RESTARTS   AGE   IP            NODE          NOMINATED NODE   READINESS GATES
pod/nginx-ingress-microk8s-controller-4hrss   1/1     Running   0          85m   10.1.254.88   microk8s-vm   <none>           <none>

NAME                                               DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR   AGE   CONTAINERS               IMAGES                                             SELECTOR
daemonset.apps/nginx-ingress-microk8s-controller   1         1         1       1            1           <none>          25h   nginx-ingress-microk8s   registry.k8s.io/ingress-nginx/controller:v1.11.2   name=nginx-ingress-microk8s

kubectl -n ingress describe po nginx-ingress-microk8s-controller-4hrss

Name:             nginx-ingress-microk8s-controller-4hrss
Namespace:        ingress
Priority:         0
Service Account:  nginx-ingress-microk8s-serviceaccount
Node:             microk8s-vm/192.168.64.9
Start Time:       Fri, 23 Aug 2024 09:16:22 +0200
Labels:           controller-revision-hash=5489ccb55d
                  name=nginx-ingress-microk8s
                  pod-template-generation=3
Annotations:      cni.projectcalico.org/containerID: 94904e61580ee1449befe245d5c84ce11f0b93fb3cda52f9a2a74e26ea81d17b
                  cni.projectcalico.org/podIP: 10.1.254.88/32
                  cni.projectcalico.org/podIPs: 10.1.254.88/32
Status:           Running
IP:               10.1.254.88
IPs:
  IP:           10.1.254.88
Controlled By:  DaemonSet/nginx-ingress-microk8s-controller
Containers:
  nginx-ingress-microk8s:
    Container ID:  containerd://56f41296d707602f46a6f5429eb834e401fa9a884ed91106644c5e71f48c73aa
    Image:         registry.k8s.io/ingress-nginx/controller:v1.11.2
    Image ID:      registry.k8s.io/ingress-nginx/controller@sha256:d5f8217feeac4887cb1ed21f27c2674e58be06bd8f5184cacea2a69abaf78dce
    Ports:         80/TCP, 443/TCP, 10254/TCP
    Host Ports:    80/TCP, 443/TCP, 10254/TCP
    Args:
      /nginx-ingress-controller
      --configmap=$(POD_NAMESPACE)/nginx-load-balancer-microk8s-conf
      --tcp-services-configmap=$(POD_NAMESPACE)/nginx-ingress-tcp-microk8s-conf
      --udp-services-configmap=$(POD_NAMESPACE)/nginx-ingress-udp-microk8s-conf
      --ingress-class=public
       
      --publish-status-address=127.0.0.1
    State:          Running
      Started:      Fri, 23 Aug 2024 09:16:36 +0200
    Ready:          True
    Restart Count:  0
    Liveness:       http-get http://:10254/healthz delay=10s timeout=5s period=10s #success=1 #failure=3
    Readiness:      http-get http://:10254/healthz delay=0s timeout=5s period=10s #success=1 #failure=3
    Environment:
      POD_NAME:       nginx-ingress-microk8s-controller-4hrss (v1:metadata.name)
      POD_NAMESPACE:  ingress (v1:metadata.namespace)
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-rjbf6 (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             True 
  ContainersReady   True 
  PodScheduled      True 
Volumes:
  lb-override:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      nginx-load-balancer-override
    Optional:  false
  kube-api-access-rjbf6:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/disk-pressure:NoSchedule op=Exists
                             node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists
                             node.kubernetes.io/pid-pressure:NoSchedule op=Exists
                             node.kubernetes.io/unreachable:NoExecute op=Exists
                             node.kubernetes.io/unschedulable:NoSchedule op=Exists
Events:                      <none>

Current state of ingress object, if applicable:

kubectl -n default get all,ing -o wide

NAME                                       READY   STATUS    RESTARTS   AGE   IP            NODE          NOMINATED NODE   READINESS GATES
pod/next-upstream-repro-5d9bb8d6cc-zsckn   1/1     Running   0          84m   10.1.254.90   microk8s-vm   <none>           <none>
pod/next-upstream-repro-5d9bb8d6cc-5bn7k   1/1     Running   0          84m   10.1.254.91   microk8s-vm   <none>           <none>

NAME                          TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)   AGE   SELECTOR
service/next-upstream-repro   ClusterIP   10.152.183.21   <none>        80/TCP    86m   app=next-upstream-repro

NAME                                  READY   UP-TO-DATE   AVAILABLE   AGE   CONTAINERS   IMAGES   SELECTOR
deployment.apps/next-upstream-repro   2/2     2            2           86m   nginx        nginx    app=next-upstream-repro

NAME                                             DESIRED   CURRENT   READY   AGE   CONTAINERS   IMAGES   SELECTOR
replicaset.apps/next-upstream-repro-5d9bb8d6cc   2         2         2       84m   nginx        nginx    app=next-upstream-repro,pod-template-hash=5d9bb8d6cc

NAME                                            CLASS   HOSTS     ADDRESS     PORTS   AGE
ingress.networking.k8s.io/next-upstream-repro   nginx   foo.bar   127.0.0.1   80      86m

kubectl -n <appnamespace> describe ing <ingressname>

Name:             next-upstream-repro
Labels:           <none>
Namespace:        default
Address:          127.0.0.1
Ingress Class:    nginx
Default backend:  <default>
Rules:
  Host        Path  Backends
  ----        ----  --------
  foo.bar     
              /   next-upstream-repro:http (10.1.254.90:80,10.1.254.91:80)
Annotations:  nginx.ingress.kubernetes.io/proxy-next-upstream: error http_404 timeout
Events:       <none>

Others:
- The backend service is configured to always respond with status 404

How to reproduce this issue:

Install minikube/kind

Minikube https://minikube.sigs.k8s.io/docs/start/
Kind https://kind.sigs.k8s.io/docs/user/quick-start/

Install the ingress controller

kubectl apply -f https://raw.githubusercontent.com/kubernetes/ingress-nginx/main/deploy/static/provider/baremetal/deploy.yaml

Install an application with at least 2 pods that will always respond with status 404

echo '
  apiVersion: apps/v1
  kind: Deployment
  metadata:
    name: next-upstream-repro
    namespace: default
  spec:
    replicas: 2
    selector:
      matchLabels:
        app: next-upstream-repro
    template:
      metadata:
        labels:
          app: next-upstream-repro
      spec:
        containers:
        - image: nginx
          imagePullPolicy: IfNotPresent
          name: nginx
          ports:
          - containerPort: 80
          volumeMounts:
            - name: conf
              mountPath: /etc/nginx/conf.d
        volumes:
          - name: conf
            configMap:
              name: next-upstream-repro
  ---
  apiVersion: v1
  kind: Service
  metadata:
    name: next-upstream-repro
    namespace: default
  spec:
    ports:
      - name: http
        port: 80
        targetPort: 80
        protocol: TCP
    type: ClusterIP
    selector:
      app: next-upstream-repro
  ---
  apiVersion: v1
  kind: ConfigMap
  metadata:
    name: next-upstream-repro
    namespace: default
  data:
    default.conf: |
      server {
        listen       80;
        server_name  localhost;

        location = / {
          return 404 "$hostname\n";
        }
      }
' | kubectl apply -f -

Create an ingress which tries next upstream on 404

echo "
  apiVersion: networking.k8s.io/v1
  kind: Ingress
  metadata:
    name: next-upstream-repro
    annotations:
      nginx.ingress.kubernetes.io/proxy-next-upstream: 'error http_404 timeout'
  spec:
    ingressClassName: nginx
    rules:
    - host: foo.bar
      http:
        paths:
        - path: /
          pathType: Prefix
          backend:
            service:
              name: next-upstream-repro
              port:
                name: http
" | kubectl apply -f -

Make many requests in parallel

POD_NAME=$(k get pods -n ingress-nginx -l app.kubernetes.io/name=ingress-nginx -o NAME)
kubectl exec -it -n ingress-nginx $POD_NAME -- bash -c "seq 1 200 | xargs -I{} -n1 -P10 curl -H 'Host: foo.bar' localhost"

Observe in the ingress controller's access logs (kubectl logs -n ingress-nginx $POD_NAME) that many requests will have the same upstream in succession in $upstream_addr, e.g.

::1 - - [23/Aug/2024:08:49:42 +0000] "GET / HTTP/1.1" 404 1 "-" "curl/8.9.0" 70 0.000 [default-next-upstream-repro-http] [] 10.1.254.92:80, 10.1.254.92:80, 10.1.254.93:80 0, 0, 1 0.000, 0.000, 0.000 404, 404, 404 afa1e1e8964286bd7d1b7664f606bb2f
::1 - - [23/Aug/2024:08:53:21 +0000] "GET / HTTP/1.1" 404 1 "-" "curl/8.9.0" 70 0.001 [default-next-upstream-repro-http] [] 10.1.254.93:80, 10.1.254.93:80, 10.1.254.93:80 0, 0, 1 0.000, 0.000, 0.000 404, 404, 404 b753b1828cc200d3c95d6ecbc6ba80e6

Anything else we need to know:
The problem is exacerbated by few (like 2 in the repro case) backend pods being hit by a large request volume concurrently. There is basically a conflict between global load balancing behaviour and per-request retries at play here. For e.g. the default round-robin load balancer, the instance is obviously shared by all requests (on an nginx worker) for a particular backend.

Assuming a system with 2 backend endpoints for the sake of simplicity, the flow of information can be as follows:

Request 1 reaches ingress nginx, gets routed to endpoint A by round robin balancer, waits for response from backend
Round robin balancer state: Next endpoint is endpoint B
Request 2 reaches ingress nginx, gets routed to endpoint B by round robin balancer, waits for response from backend
Round robin balancer state: Next endpoint is endpoint A
Response from endpoint A fails for request 1, proxy_next_upstream config requests another endpoint from the load balancing system, it gets routed to endpoint A by round robin balancer
Round robin balancer state: Next endpoint is endpoint B
Request 3 reaches ingress nginx, gets routed to endpoint B by round robin balancer, waits for response from backend
Round robin balancer state: Next endpoint is endpoint A
Response from endpoint B fails for request 2, proxy_next_upstream config requests another endpoint from the load balancing system, it gets routed to endpoint A by round robin balancer
Responses from all endpoints for request 1, 2, and 3 succeed

As you can see, this means request 1 is only handled by endpoint A despite the proxy_next_upstream directive. Depending on the actual rate and order of requests etc, request 2 could have faced a similar fate, but request 3 came in before the initial response failed, so it happens to work out in that case.

This makes proxy-next-upstream extremely unreliable and behave in unexpected ways. An approach to fixing this would be that the Lua-based load balancing be made aware of what endpoints have already been tried. The semantics are hard to nail down exactly, however, since this might break the guarantees that some load balancing strategies aim to provide. On the other hand, having the next upstream choice work reliably at all is invaluable for bridging over requests in a failure scenario. A backend endpoint might become unreachable, which should result in it eventually being removed from the load balancing once probes have caught up to the fact. In the meantime, the default error timeout strategy would try the "next" available upstream for any requests trying that endpoint, but if everything aligns just right, the load balancer would always return the same endpoint, resulting in a 502 despite the system at large being perfectly capable of handling the request.

The text was updated successfully, but these errors were encountered:

k8s-ci-robot · 2024-08-23T09:18:14Z

This issue is currently awaiting triage.

If Ingress contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

longwuyuan · 2024-08-23T09:30:01Z

/move-kind bug
/kind support
/triage needs-information

I understand why your reproduce example choice is to have pods with nginx configured to return 404 on location /.

But that can not be considered a real-world use-case as real-world workloads don't have all the pods configured for returning 404.

If you want you can change the test to a real-world use case where first the pod or pods are returning 200. Then introduce a event for 4XX or 5XX. And so on and so forth. But unless you can post the ta like kubectl describe outputs and kubectl logs outputs and curl outputs and responses etc etc etc, others will have to make efforts over and beyond the normal to even triage this issue.

longwuyuan · 2024-08-23T09:30:26Z

/remove-kind bug

marvin-roesch · 2024-08-23T09:51:48Z

@longwuyuan While I agree the reproduction example is a bit contrived and not particularly reflective of any real world use case, it is the easiest way to reliably reproduce this issue without overly complicating the test case. The sporadic nature of this issue is why I have opted for such a simplistic approach for reproducing it. If the backend service is acting reliably at all (preventing proxy_next_upstream from having to do anything), the probability of encountering this issue goes down drastically. It doesn't particularly matter that in the end the response still is 404, the access logs I have included clearly demonstrate the issue.

To maybe point you more directly to where the issue lies as can be seen from my reproduction example, note this access log line that one of my curls produced in the ingress controller nginx:

::1 - - [23/Aug/2024:08:53:21 +0000] "GET / HTTP/1.1" 404 1 "-" "curl/8.9.0" 70 0.001 [default-next-upstream-repro-http] [] 10.1.254.93:80, 10.1.254.93:80, 10.1.254.93:80 0, 0, 1 0.000, 0.000, 0.000 404, 404, 404 b753b1828cc200d3c95d6ecbc6ba80e6

As you can see, the $upstream_addr value is 10.1.254.93:80, 10.1.254.93:80, 10.1.254.93:80, so the same endpoint gets used thrice over despite the proxy_next_upstream config. I have omitted the surrounding access log lines that look similar (just with a few different $upstream_addr values) since they just add noise to the fairly random nature of this issue.

I have amended the command for getting the logs from the ingress controller and will happily provide more information, but I think the example I have provided is the minimal reproducible one. The problem happens entirely in the ingress nginx and for any error case that proxy_next_upstream can handle, a 404 is just much simpler to produce than a connection error.

longwuyuan · 2024-08-23T16:01:46Z

ok. I am on a learning curve here so please help out with some questions.

Replicas is 2 and both are causing a lookup for next upstream. Do I have to reproduce on my own to figure out what happens when there is at least one replica that is doing 200 instead of 404 ? Does that one not get picked ?

marvin-roesch · 2024-08-23T21:11:33Z

If any of the upstreams that get picked return a non-error response, nginx behaves as expected and ends the retry chain there. Since for any one request, another attempt is only performed in the case there is an error according to the proxy_next_upstream config, the problem lies solely with how the next upstream gets picked by the Lua implementation of load balancing as shipped with and used by the ingress controller by default (https://github.com/kubernetes/ingress-nginx/blob/main/rootfs/etc/nginx/lua/balancer.lua being the entry point for this).

The default template configures proxy_next_upstream for upto 3 attempts, which is where the 3 occurrences in $upstream_addr in the access logs come from. Leaving everything else also at the default (i.e. using round-robin load balancing), manually firing an occasional request is not going to show the issue of the same upstream being used twice in a row, because the global load balancer state isn't affected by multiple concurrent requests. That's why my repro has a command that performs several requests in parallel.

ZJfans · 2024-08-26T10:25:43Z

In Nginx, fail_timeout and max_fails remove failed backend for a certain period of time, but the balancer does not have this capability.

proxy_next_upstream_tries 3;
upstream test {
    server 127.0.0.1:8080 fail_timeout=30s max_fails=1; # Server A
    server 127.0.0.1:8081 fail_timeout=30s max_fails=1; # Server B

}

If A fails once, it will be removed for 30 seconds

marvin-roesch · 2024-09-13T07:33:18Z

While experimenting with the upstream-hash-by annotation and accordingly the chash load balancer, I have noticed that the issue is even worse there. If you have proxy_next_upstream configured (as is the default), and an ingress utilizing that load balancer, a different upstream will never be attempted for a single request because the parameters always hash to the same upstream.

This can easily be verified by adding e.g. the nginx.ingress.kubernetes.io/upstream-hash-by: "$request_uri" annotation to my example upstream from the initial issue description. Even for single requests, you can then observe that the $upstream_addr will list the same upstream multiple times over.

The OpenResty load balancers have support for selecting a "next" upstream (although the round-robin implementation is lackluster, as it is the same as the regular balancing logic, so it's subject to the same concurrency problem), e.g. with this implementation for the consistent hashing balancer:
https://github.com/openresty/lua-resty-balancer/blob/1cd4363c0a239afe4765ec607dcfbbb4e5900eea/lib/resty/chash.lua#L319-L324

This has no effect for ingress-nginx, as only the find method will ever be used when load balancing - retry or not - e.g.

ingress-nginx/rootfs/etc/nginx/lua/balancer/chash.lua

Lines 29 to 32 in 1c2aecb

    
           function _M.balance(self) 
        
             local key = util.generate_var_value(self.hash_by) 
        
             return self.instance:find(key) 
        
           end

I think what needs to happen at the very least is that the usage of these load balancers in ingress-nginx's Lua code retain the index of the last used upstream and then use the next method rather than find when a retry is being attempted. Otherwise, the default proxy_next_upstream config is just confusing and misleading.

longwuyuan · 2024-09-13T07:57:52Z

@marvin-roesch There are no resources like developer time available to allocate for research and triaging. Particularly so for features that are further away from the core Ingress-API specs.

Secondly, you have only 2 pods in your test, both pods are configured to return 404 for ALL requests, and you are expecting the controller to route to next_upstream and get a 200. If I understand your test correctly, I can't guess how/where/why the controller should route to a new next_upstream(ipaddress of pod) when there is no 3rd pod that is healthy or capable of responding with 200.

marvin-roesch · 2024-09-13T08:19:20Z

Again, @longwuyuan, the fact that all pods return 404 should not matter. There might well be a real-world case where a system is in a bad state and thus all upstreams return an error - that is fine for that case and "expected". The problem is that ingress-nginx does not attempt any other upstream at all, it just tries the same one over and over again (in the case of consistent hashing) or whatever the round robin state dictates right now (causing issues with many concurrent requests as exemplified in my initial report). The expected behavior here coming purely from an nginx perspective without any fancy Lua-based load balancing is that a different upstream would be picked.

If a pod is unreachable for long enough that readiness probes start to fail etc, the behavior will be "correct" in a real-life system since the affected pods will just be taken out of the rotation, but for the duration between the requests failing and the readiness probe registering that, the proxy_next_upstream nginx config should work to bridge this gap. That does not apply here, however. As shown by my last comment, it is in fact completely broken for the consistent hashing case.

If you really want me to, I can come up with a reproduction case that does sometimes return a 200, but again, that will only obscure the actual problem and make it harder to reproduce reliably.

The real world case where we first encountered this was when a high traffic service had one of its pod crash and we had tons of requests failing despite several other pods for that service still being healthy. Our root cause analysis showed that the ingress was retrying the same upstream IP for the failing requests, but behaved "correctly" (according to proxy_next_upstream) for others. I have opted for a much more simplified minimal reproducible example because getting circumstances just right otherwise (high enough traffic, have a pod crash at the right time etc.) is quite difficult.

longwuyuan · 2024-09-14T02:06:55Z

Thanks @marvin-roesch , at least this discussion is adding info here, that could complete the triaging.

My request now to you is help me clarify what I understand from your comments. So I am cop/pasting a select set of words from your below ;

The problem is that ingress-nginx does not attempt any other upstream at all,

based on those words, I want to draw your attention one more time, to my understanding of your test. Both pods of the backend workload return 404. OR, all pods of the real-world scene, "system is in a bad state and thus all upstreams return an error". The bad response pods are removed from the list of upstreams. This means that there is no pod that qualifies or stays in the list of healthy available upstreams.

This makes me assume the controller trying the same last upstream 3 time instead of a new upstream is not abnormal. There is no other upstream that the controller can try.

Your expectation is in the below words ;

The problem is that ingress-nginx does not attempt any other upstream at all,

I am bothering you so much because how can the controller attempt any other upstream, if there are none in Ready state.

You can choose to ignore my comments if that is better because I am not a developer. To summarize if I can run a 3 replica workload and manually trick 2 of them to fail, we can get proof that there was at least one replica healthy and the controller should have routed to that healthy replica as next_upstream. And also there is shortage of resources as I mentioned, so community contributors here would be a help.

longwuyuan · 2024-09-14T02:52:08Z

Also @marvin-roesch , just to reduce my repeated questions doing anything odd, I wanted to also clarify that I ack the below fact from you

but for the duration between the requests failing and the readiness probe registering that, the proxy_next_upstream nginx config should work to bridge this gap

And I am asking my question after reading this fact. As there is not a clear smoking gun data that shows the state of retrying same upstream pre & post this window of time.

marvin-roesch · 2024-09-14T09:10:39Z

@longwuyuan Thanks for trying to understand this! I'll try my best to answer your questions. I'll be picking out the sentences that I think need addressing, too.

The bad response pods are removed from the list of upstreams. This means that there is no pod that qualifies or stays in the list of healthy available upstreams.

This makes me assume the controller trying the same last upstream 3 time instead of a new upstream is not abnormal. There is no other upstream that the controller can try.

I think you've addressed this yourself in your following comment, but just to reiterate and confirm: Yes, there is no issue if Kubernetes has marked the affected pods as non-Ready and that change has propagated through to the nginx configuration. This can take several seconds though, which can be quite critical for a very high traffic system.

As there is not a clear smoking gun data that shows the state of retrying same upstream pre & post this window of time.

Yep, as outlined above, this is solely an issue for the short amount of time when the pods are in a bad state but the rest of the system has not been made aware of this yet. The ingress controller is working perfectly fine for us once that's the case.

To summarize if I can run a 3 replica workload and manually trick 2 of them to fail, we can get proof that there was at least one replica healthy and the controller should have routed to that healthy replica as next_upstream.

Yep, that would be the way to replicate this that's a little closer to the real-life example we encountered. Always responding the 404s (or any other error that proxy_next_upstream is configured to retry on ) is functionally the same, however. The behavior is entirely defined by the configuration of the nginx instance that the ingress controller is running, the parts that are concerned with the Ingress API are not involved.

And also there is shortage of resources as I mentioned, so community contributors here would be a help.

I've been trying my best to point out the pertinent pieces of the code that cause this. The load balancing logic would somehow need to be made aware of which upstreams it has already tried. If using nginx's upstream directive with server instances directly, nginx takes care of all of that, but the Lua implementation completely circumvents that.

I'd be happy to contribute a fix for this, but I'm unsure about what approach to take. It might be easiest to pull what the EWMA load balancer is doing into the others, see the following snippets. I was surprised to find it already takes care of this. It seems this was addresses as part of solving #6632, but the issue still affects all other load balancers.

ingress-nginx/rootfs/etc/nginx/lua/balancer/ewma.lua

Lines 184 to 190 in 0111961

    
           local tried_endpoints 
        
           if not ngx.ctx.balancer_ewma_tried_endpoints then 
        
             tried_endpoints = {} 
        
             ngx.ctx.balancer_ewma_tried_endpoints = tried_endpoints 
        
           else 
        
             tried_endpoints = ngx.ctx.balancer_ewma_tried_endpoints 
        
           end

ingress-nginx/rootfs/etc/nginx/lua/balancer/ewma.lua

Line 213 in 0111961

tried_endpoints[get_upstream_name(endpoint)] = true

longwuyuan · 2024-09-14T12:05:45Z

It seems the summary is ;

for a high traffic use-case, there is a variable-length window of time between pod going away but endpointSlice yet-to-be updated.
this window breaks EASTABLISHED connections as well no new connections are possible, during this window, because the pod configured with the ip-address in the endpointSlice not returning 200s anymore
this only happens with round-robin loadbalancing
this does not happen with ewma
you are interested in contributing a fix
one helpful way forward is try a discussion in the ingress-nginx-dev channel of slack
request that you join the next community meeting as maintainers are there on the call

PS - Please express your thoughts on just using ewma for now

marvin-roesch · 2024-09-16T07:52:10Z

@longwuyuan That summary seems mostly correct to me, except for one thing: Looking at all the load balancer implementations, this affects the round-robin balancer as well as both consistent hashing ones (chash and chashsubset which will be used when using the upstream-hash-by annotation). EWMA and the sticky session balancers are not affected, but they use different techniques for avoiding previously tried upstreams.

I'll check if EWMA works for our purposes in the meantime. Thanks for pointing me to the Slack, I have joined it and will create a thread to start discussion of potential ways forward. I think aligning all load balancer implementations is the way to go.

longwuyuan · 2024-09-16T08:48:37Z

Thanks a lot @marvin-roesch . This is a action items overview that is getting tracked here so helps a lot.

I will copy/paste the summary and add the note you made.

Summary

for a high traffic use-case, there is a variable-length window of time between pod going away but endpointSlice yet-to-be updated.
this window breaks EASTABLISHED connections as well no new connections are possible, during this window, because the pod configured with the ip-address in the endpointSlice not returning 200s anymore
this only happens with round-robin loadbalancing
this affects the round-robin balancer as well as both consistent hashing ones (chash and chashsubset which will be used when using the upstream-hash-by
this does not happen with ewma. EWMA and the sticky session balancers are not affected, but they use different techniques for avoiding previously tried upstreams
you are interested in contributing a fix
one helpful way forward is try a discussion in the ingress-nginx-dev channel of slack
request that you join the next community meeting as maintainers are there on the call
you will update if temporary use of ewma works for you

cc @rikatz @tao12345666333 @strongjz @Gacko

longwuyuan · 2024-09-16T08:54:49Z

/remove-triage needs-information
/remove-kind support
/kind bug

marvin-roesch added the kind/bug Categorizes issue or PR as related to a bug. label Aug 23, 2024

k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority labels Aug 23, 2024

k8s-ci-robot added kind/support Categorizes issue or PR as a support question. triage/needs-information Indicates an issue needs more information in order to work on it. labels Aug 23, 2024

k8s-ci-robot removed the kind/bug Categorizes issue or PR as related to a bug. label Aug 23, 2024

k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. and removed triage/needs-information Indicates an issue needs more information in order to work on it. kind/support Categorizes issue or PR as a support question. labels Sep 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

proxy-next-upstream (including default on error and timeout) does not always pick a different upstream depending on load balancer and concurrent requests #11852

proxy-next-upstream (including default on error and timeout) does not always pick a different upstream depending on load balancer and concurrent requests #11852

marvin-roesch commented Aug 23, 2024 •

edited

Loading

k8s-ci-robot commented Aug 23, 2024

longwuyuan commented Aug 23, 2024

longwuyuan commented Aug 23, 2024

marvin-roesch commented Aug 23, 2024

longwuyuan commented Aug 23, 2024

marvin-roesch commented Aug 23, 2024 •

edited

Loading

ZJfans commented Aug 26, 2024 •

edited

Loading

marvin-roesch commented Sep 13, 2024

longwuyuan commented Sep 13, 2024

marvin-roesch commented Sep 13, 2024

longwuyuan commented Sep 14, 2024

longwuyuan commented Sep 14, 2024

marvin-roesch commented Sep 14, 2024

longwuyuan commented Sep 14, 2024

marvin-roesch commented Sep 16, 2024

longwuyuan commented Sep 16, 2024

longwuyuan commented Sep 16, 2024

proxy-next-upstream (including default on error and timeout) does not always pick a different upstream depending on load balancer and concurrent requests #11852

proxy-next-upstream (including default on error and timeout) does not always pick a different upstream depending on load balancer and concurrent requests #11852

Comments

marvin-roesch commented Aug 23, 2024 • edited Loading

Install minikube/kind

Install the ingress controller

Install an application with at least 2 pods that will always respond with status 404

Create an ingress which tries next upstream on 404

Make many requests in parallel

k8s-ci-robot commented Aug 23, 2024

longwuyuan commented Aug 23, 2024

longwuyuan commented Aug 23, 2024

marvin-roesch commented Aug 23, 2024

longwuyuan commented Aug 23, 2024

marvin-roesch commented Aug 23, 2024 • edited Loading

ZJfans commented Aug 26, 2024 • edited Loading

marvin-roesch commented Sep 13, 2024

longwuyuan commented Sep 13, 2024

marvin-roesch commented Sep 13, 2024

longwuyuan commented Sep 14, 2024

longwuyuan commented Sep 14, 2024

marvin-roesch commented Sep 14, 2024

longwuyuan commented Sep 14, 2024

marvin-roesch commented Sep 16, 2024

longwuyuan commented Sep 16, 2024

longwuyuan commented Sep 16, 2024

marvin-roesch commented Aug 23, 2024 •

edited

Loading

marvin-roesch commented Aug 23, 2024 •

edited

Loading

ZJfans commented Aug 26, 2024 •

edited

Loading