-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Connection refused randomly for pairs of pods #12681
Comments
Hi @zack-littke-smith! I'd recommend looking at the full client proxy logs, beyond those two log lines in particular. The Linkerd proxy will log when addresses are added to its load balancers so the first thing I'd look into is if the correct addresses for |
Before we see errors, we have the following client logs:
We also see the following additional failures which I didn't notice before and didn't link above:
|
Ah, the proxy logging that I referred to was added after stable-2.14.1. If you upgrade to a recent edge release, you'll have more informative proxy logging about the state of the load balancer and why the service is entering fail-fast. |
I'll look into getting us more up-to-date and come back around. Is there anything else about our setup that stands out to you as being problematic, or anything else I should turn on in the meantime? This issue is quite rare for us and so cycling back here might be slow |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions. |
What is the issue?
I am running into a really difficult-to-reproduce issue where our k8s pod will somehow decide it will not serve certain clients, giving logs in the client proxy:
And:
However during this time, the service does successfully connect to other clients and serve their requests descriminately. Restarting the clients has no effect, and restarting the service can 'sometimes' help, resulting in reconnection to some clients but failure to reconnect to others.
The only 'solution' we've seen success with is restarting every single linkerd container and proxy-having service, which is not ideal to say the least.
While I have no solid repro, I'm hoping to at least take away some debugging tips for the next time this happens to us.
How can it be reproduced?
Unfortunately I have not been able to reliably reproduce in our own environments
Logs, error output, etc
Proxy logs from the service:
Logs from the client proxy included above
output of
linkerd check -o short
Environment
linkerd_controller: stable-2.14.1
linkerd_debug: stable-2.14.1
linkerd_grafana: stable-2.11.1
linkerd_metrics_api: stable-2.14.1
linkerd_policy_controller: stable-2.14.1
linkerd_proxy: stable-2.14.1
linkerd_proxy_init: v2.2.3
linkerd_tap: stable-2.14.1
linkerd_web: stable-2.14.1
Possible solution
No response
Additional context
No response
Would you like to work on fixing this bug?
None
The text was updated successfully, but these errors were encountered: