-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Server scaling causes ingress TCP retransmissions, and the client cannot access the server properly. #11508
Comments
This issue is currently awaiting triage. If Ingress contributors determines this is a relevant issue, they will accept it by applying the The Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Currently, I'm using |
@ZhengXinwei-F the information you have provided is useful to become aware but useless to do much analysis or debugging
|
Here is my simple demo code: server code:
client:
server.yaml:
hosts: |
Thank you so much for your response. I will pay attention to my information format in future issues. However, I'm particularly curious about why the ingress continues to send TCP requests to the removed pod even after all pods have been deleted. |
At a glance you did NOT test with a NON-golang HTTP library like ;
So there is no data about established UN-Terminated connections to a backend. Secondly, I can't find this annotation This means that the data & information you are presenting here is not really as reliable as you desire it to be. I think you are experiencing some genuine problem so its fair to assume that you want to ask about the persistent connection and continued requests, here. But you are not putting together the impact of the factors like ;
in a way that other readers here would need for analysis. Endpointslice is used to maintain the list of backend pods. So a lot of intricate debugging steps are involved here to answer your question. Please wait for comments from others also. I have no data to comment on your question |
/remove-kind bug |
I think your response is hasty. The /remove-kind support |
It's important to note that both the connections between the ingress and the client, as well as between the ingress and the server, are HTTP long connections using |
The core issues are as follows: When there are 1-2 instances, the ingress does not attempt to access deleted pods. |
ok sorry. wait for comments from others who can reproduce the problem. |
Thank you again for your prompt response. |
I tried to reproduce with image nginx:alpine and I am not able to reproduce that error message so I don't think this can be classified as a bug because there is no reproduce procedure |
I apologize for the incomplete reproduction information. I will provide a more detailed reproduction scenario shortly. Currently, I suspect it might be related to endpointSlice informer delays. I am also attempting to read through the ingress-controller code, but it is not a simple task. |
this issue can tagged as a bug is invalid because the concern is related to the unmet basic requirements of a bug ;
there is no data for a developer to accept the triage as scaling broken in the controller |
I will upgrade the ingress-controller to the latest version to see if this resolves the issue. Additionally, since I am using HTTP chunked transfer over TCP, the scenario is little complex. |
I wanted to get to it later, if data showed any relevance. But since you repeat mentioned chunking, I have to make it clear that setting chunking in app code is a big problem. That is because v1.10.1 of the controller sets chunking on, out of the box, as per latest nginx design. Another user removed the code for setting chunking header, from his app #11162 But lets no go there aimless. Just try to scale down from 5 to 1, with a vanilla nginx:alpine image, and gather the same same debugging data, from controller v1.10.1 . Do it on a kind cluster or a minikube cluster so others can replicate your test and reproduce the problem you are reporting. Because the controller code will work the same, regardless if its your golang app or a simple curl client. |
I've been researching this issue for the past few days. During the deployment rollout of the backend server, ingress indeed continues to route traffic to deleted pods. Through code analysis, it appears that this is related to ingress list-watching configmaps, ingress, endpointSlices, and other resources. Each event triggers a complete upstream refresh, and this process is serial. Reducing the Additionally, there are other issues I'm encountering. It seems that the ingress's TCP keep-alive uses the kernel configuration within the container, causing it to wait a long time without sending TCP keep-alive packets. Only the server's TCP keep-alive is functioning, which is ineffective when the server restarts. Furthermore, Kubernetes's client-go watchHandler can hang due to this, and despite waiting well beyond the watchHandler timeout, it still doesn't trigger a relist. This issue is quite complex and cannot be easily reproduced. It's evidently not just a problem within the Thanks to @longwuyuan for the assistance. I will change this issue to /remove-kind bug |
Thank you for the udpate @ZhengXinwei-F Slowly, the requirement is shifting to being completely agnostic to platform, infra & network in the apps, simply from the fact that app code needs to be portable/flexible. You can close the issue as there are no more questions to the project and continue with discussions in a closed issue. Open issues just add to the tally. You can reopen if a action-item on the project comes up. |
My issue has not been fully resolved, so I need to keep this issue open for a while. I plan to close it in about a week. |
ok thanks for updating that. cheers. |
/close |
@ZhengXinwei-F: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
What happened:
I have a server providing list-watch capabilities exposed to the outside via an ingress-controller. When there are only 1-2 instances of the server, scaling operations cause client requests to fail but they recover quickly. However, when there are three instances, the client (using client-go) does not break its TCP connection with the ingress (this is issue 1). Additionally, the ingress continues to send TCP SYN packets to the already dead pod, resulting in continuous TCP retransmissions. As a result, the client cannot properly access the ingress.
For example:
Ingress-controller logs:
Service "default/server" does not have any active Endpoint.
This indicates that the ingress recognizes there are no active endpoints after a failed request. However, it still attempts to access the already dead pod. Packet capture shows TCP retransmissions and triggers backoff time.
What you expected to happen:
The client can properly access the ingress.
This issue seems to be caused by some form of endpoint caching in the ingress. However, it does not occur when there are only 1-2 server instances, but it does occur when there are 3 or more instances. Could you please help me analyze this issue? Thank you very much.
NGINX Ingress controller version (exec into the pod and run nginx-ingress-controller --version.):
NGINX Ingress controller
Release: v1.6.3
Build: 7ae9ca2
Repository: https://github.com/kubernetes/ingress-nginx
nginx version: nginx/1.21.6
&
NGINX Ingress controller
Release: v0.49.3
Build: git-0a2ec01eb
Repository: https://github.com/kubernetes/ingress-nginx.git
nginx version: nginx/1.20.1
Kubernetes version (use
kubectl version
):Environment:
ubuntu
uname -a
):Linux CN0314000682W 5.15.153.1-microsoft-standard-WSL2 #1 SMP Fri Mar 29 23:14:13 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
Please mention how/where was the cluster created like kubeadm/kops/minikube/kind etc.
kind
kubectl version
kubectl get nodes -o wide
kubectl describe ingressclasses
kubectl -n <ingresscontrollernamespace> get all -A -o wide
kubectl -n <ingresscontrollernamespace> describe po <ingresscontrollerpodname>
kubectl -n <ingresscontrollernamespace> describe svc <ingresscontrollerservicename>
The text was updated successfully, but these errors were encountered: