Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Connection refused randomly for pairs of pods #12681

Closed
zack-littke-smith-ai opened this issue Jun 4, 2024 · 5 comments
Closed

Connection refused randomly for pairs of pods #12681

zack-littke-smith-ai opened this issue Jun 4, 2024 · 5 comments

Comments

@zack-littke-smith-ai
Copy link

zack-littke-smith-ai commented Jun 4, 2024

What is the issue?

I am running into a really difficult-to-reproduce issue where our k8s pod will somehow decide it will not serve certain clients, giving logs in the client proxy:

WARN ThreadId(01) linkerd_reconnect: Failed to connect error=Connection refused (os error 111)

And:

INFO ThreadId(01) outbound:proxy{addr=10.100.32.3:10079}:rescue{client.addr=172.28.187.94:55562}: linkerd_app_core::errors::respond: gRPC request failed error=logical service service-name.namespace.svc.cluster.local:10079: service unavailable error.sources=[service unavailable]

However during this time, the service does successfully connect to other clients and serve their requests descriminately. Restarting the clients has no effect, and restarting the service can 'sometimes' help, resulting in reconnection to some clients but failure to reconnect to others.

The only 'solution' we've seen success with is restarting every single linkerd container and proxy-having service, which is not ideal to say the least.

While I have no solid repro, I'm hoping to at least take away some debugging tips for the next time this happens to us.

How can it be reproduced?

Unfortunately I have not been able to reliably reproduce in our own environments

Logs, error output, etc

Proxy logs from the service:

[ 0.001766s] INFO ThreadId(01) linkerd2_proxy: release 2.210.0 (85db2fc) by linkerd on 2023-09-21T21:24:58Z
[ 0.002498s] INFO ThreadId(01) linkerd2_proxy::rt: Using single-threaded proxy runtime
[ 0.003107s] INFO ThreadId(01) linkerd2_proxy: Admin interface on 0.0.0.0:4191
[ 0.003116s] INFO ThreadId(01) linkerd2_proxy: Inbound interface on 0.0.0.0:4143
[ 0.003118s] INFO ThreadId(01) linkerd2_proxy: Outbound interface on 127.0.0.1:4140
[ 0.003121s] INFO ThreadId(01) linkerd2_proxy: Tap interface on 0.0.0.0:4190
[ 0.003122s] INFO ThreadId(01) linkerd2_proxy: Local identity is default.namespace.serviceaccount.identity.linkerd.cluster.local
[ 0.003124s] INFO ThreadId(01) linkerd2_proxy: Identity verified via linkerd-identity-headless.linkerd.svc.cluster.local:8080 (linkerd-identity.linkerd.serviceaccount.identity.linkerd.cluster.local)
[ 0.003126s] INFO ThreadId(01) linkerd2_proxy: Destinations resolved via linkerd-dst-headless.linkerd.svc.cluster.local:8086 (linkerd-destination.linkerd.serviceaccount.identity.linkerd.cluster.local)
[ 0.019669s] INFO ThreadId(02) daemon:identity: linkerd_app: Certified identity id=default.namespace.serviceaccount.identity.linkerd.cluster.local
[ 0.001800s] INFO ThreadId(01) linkerd2_proxy: release 2.210.0 (85db2fc) by linkerd on 2023-09-21T21:24:58Z
[ 0.002498s] INFO ThreadId(01) linkerd2_proxy::rt: Using single-threaded proxy runtime
[ 0.003148s] INFO ThreadId(01) linkerd2_proxy: Admin interface on 0.0.0.0:4191
[ 0.003164s] INFO ThreadId(01) linkerd2_proxy: Inbound interface on 0.0.0.0:4143
[ 0.003166s] INFO ThreadId(01) linkerd2_proxy: Outbound interface on 127.0.0.1:4140
[ 0.003168s] INFO ThreadId(01) linkerd2_proxy: Tap interface on 0.0.0.0:4190
[ 0.003171s] INFO ThreadId(01) linkerd2_proxy: Local identity is default.namespace.serviceaccount.identity.linkerd.cluster.local
[ 0.003173s] INFO ThreadId(01) linkerd2_proxy: Identity verified via linkerd-identity-headless.linkerd.svc.cluster.local:8080 (linkerd-identity.linkerd.serviceaccount.identity.linkerd.cluster.local)
[ 0.003175s] INFO ThreadId(01) linkerd2_proxy: Destinations resolved via linkerd-dst-headless.linkerd.svc.cluster.local:8086 (linkerd-destination.linkerd.serviceaccount.identity.linkerd.cluster.local)
[ 0.012067s] INFO ThreadId(02) daemon:identity: linkerd_app: Certified identity id=default.namespace.serviceaccount.identity.linkerd.cluster.local

Logs from the client proxy included above

output of linkerd check -o short

---------------
‼ cli is up-to-date
    unsupported version channel: stable-2.14.1
    see https://linkerd.io/2.14/checks/#l5d-version-cli for hints

control-plane-version
---------------------
‼ control plane is up-to-date
    unsupported version channel: stable-2.14.1
    see https://linkerd.io/2.14/checks/#l5d-version-control for hints

linkerd-control-plane-proxy
---------------------------
‼ control plane proxies are up-to-date
    some proxies are not running the current version:
        * linkerd-destination-6954bdcf79-6p7z5 (3cd7d7a0849f124af2156783ae1989d0a1248d412341cd97f781e60feae98dda)
	* linkerd-destination-6954bdcf79-df9f2 (3cd7d7a0849f124af2156783ae1989d0a1248d412341cd97f781e60feae98dda)
	* linkerd-destination-6954bdcf79-jnncs (3cd7d7a0849f124af2156783ae1989d0a1248d412341cd97f781e60feae98dda)
	* linkerd-identity-5958cdbd64-gc2qp (3cd7d7a0849f124af2156783ae1989d0a1248d412341cd97f781e60feae98dda)
	* linkerd-identity-5958cdbd64-ph8v8 (3cd7d7a0849f124af2156783ae1989d0a1248d412341cd97f781e60feae98dda)
	* linkerd-identity-5958cdbd64-qsh5m (3cd7d7a0849f124af2156783ae1989d0a1248d412341cd97f781e60feae98dda)
	* linkerd-proxy-injector-7664c7cf84-77vl9 (3cd7d7a0849f124af2156783ae1989d0a1248d412341cd97f781e60feae98dda)
	* linkerd-proxy-injector-7664c7cf84-khhfp (3cd7d7a0849f124af2156783ae1989d0a1248d412341cd97f781e60feae98dda)
	* linkerd-proxy-injector-7664c7cf84-xzz9x (3cd7d7a0849f124af2156783ae1989d0a1248d412341cd97f781e60feae98dda)
    see https://linkerd.io/2.14/checks/#l5d-cp-proxy-version for hints
‼ control plane proxies and cli versions match
    linkerd-destination-6954bdcf79-6p7z5 running 3cd7d7a0849f124af2156783ae1989d0a1248d412341cd97f781e60feae98dda but cli running stable-2.14.1
    see https://linkerd.io/2.14/checks/#l5d-cp-proxy-cli-version for hints

linkerd-viz
-----------
‼ viz extension proxies are up-to-date
    some proxies are not running the current version:
	* grafana-6c4c8b997d-ptswf (3cd7d7a0849f124af2156783ae1989d0a1248d412341cd97f781e60feae98dda)
	* metrics-api-7d685f8896-f4d52 (3cd7d7a0849f124af2156783ae1989d0a1248d412341cd97f781e60feae98dda)
	* prometheus-dd8b5b7f4-2rsgn (3cd7d7a0849f124af2156783ae1989d0a1248d412341cd97f781e60feae98dda)
	* tap-59769cd568-7t92z (3cd7d7a0849f124af2156783ae1989d0a1248d412341cd97f781e60feae98dda)
	* tap-injector-6f987fddf9-f9fs5 (3cd7d7a0849f124af2156783ae1989d0a1248d412341cd97f781e60feae98dda)
	* web-7c6ff5b7d-7tdb6 (3cd7d7a0849f124af2156783ae1989d0a1248d412341cd97f781e60feae98dda)
    see https://linkerd.io/2.14/checks/#l5d-viz-proxy-cp-version for hints
‼ viz extension proxies and cli versions match
    grafana-6c4c8b997d-ptswf running 3cd7d7a0849f124af2156783ae1989d0a1248d412341cd97f781e60feae98dda but cli running stable-2.14.1
    see https://linkerd.io/2.14/checks/#l5d-viz-proxy-cli-version for hints

Status check results are √

Environment

linkerd_controller: stable-2.14.1
linkerd_debug: stable-2.14.1
linkerd_grafana: stable-2.11.1
linkerd_metrics_api: stable-2.14.1
linkerd_policy_controller: stable-2.14.1
linkerd_proxy: stable-2.14.1
linkerd_proxy_init: v2.2.3
linkerd_tap: stable-2.14.1
linkerd_web: stable-2.14.1

Possible solution

No response

Additional context

No response

Would you like to work on fixing this bug?

None

@adleong
Copy link
Member

adleong commented Jun 5, 2024

Hi @zack-littke-smith! I'd recommend looking at the full client proxy logs, beyond those two log lines in particular. The Linkerd proxy will log when addresses are added to its load balancers so the first thing I'd look into is if the correct addresses for service-name.namespace.svc.cluster.local:10079 have been added to the client proxy's load balancer.

@zack-littke-smith-ai
Copy link
Author

zack-littke-smith-ai commented Jun 5, 2024

Before we see errors, we have the following client logs:

[ 0.001866s] INFO ThreadId(01) linkerd2_proxy: release 2.210.0 (85db2fc) by linkerd on 2023-09-21T21:24:58Z
[ 0.002681s] INFO ThreadId(01) linkerd2_proxy::rt: Using single-threaded proxy runtime
[ 0.003389s] INFO ThreadId(01) linkerd2_proxy: Admin interface on 0.0.0.0:4191
[ 0.003426s] INFO ThreadId(01) linkerd2_proxy: Inbound interface on 0.0.0.0:4143
[ 0.003430s] INFO ThreadId(01) linkerd2_proxy: Outbound interface on 127.0.0.1:4140
[ 0.003432s] INFO ThreadId(01) linkerd2_proxy: Tap interface on 0.0.0.0:4190
[ 0.003434s] INFO ThreadId(01) linkerd2_proxy: Local identity is default.namespace.serviceaccount.identity.linkerd.cluster.local
[ 0.003436s] INFO ThreadId(01) linkerd2_proxy: Identity verified via linkerd-identity-headless.linkerd.svc.cluster.local:8080 (linkerd-identity.linkerd.serviceaccount.identity.linkerd.cluster.local)
[ 0.003438s] INFO ThreadId(01) linkerd2_proxy: Destinations resolved via linkerd-dst-headless.linkerd.svc.cluster.local:8086 (linkerd-destination.linkerd.serviceaccount.identity.linkerd.cluster.local)
[ 0.015661s] INFO ThreadId(02) daemon:identity: linkerd_app: Certified identity id=default.namespace.serviceaccount.identity.linkerd.cluster.local
[ 6.059523s] WARN ThreadId(01) outbound:proxy{addr=10.100.32.3:10079}: linkerd_stack::failfast: Service entering failfast after 3s

// First error here:
[ 6.059608s] INFO ThreadId(01) outbound:proxy{addr=10.100.32.3:10079}:rescue{client.addr=172.28.157.182:50400}: linkerd_app_core::errors::respond: gRPC request failed error=logical service simian-config.namespace.svc.cluster.local:10079: service in fail-fast error.sources=[service in fail-fast]

We also see the following additional failures which I didn't notice before and didn't link above:

[ 89.602769s] WARN ThreadId(01) linkerd_reconnect: Service failed error=channel closed

@adleong
Copy link
Member

adleong commented Jun 5, 2024

Ah, the proxy logging that I referred to was added after stable-2.14.1. If you upgrade to a recent edge release, you'll have more informative proxy logging about the state of the load balancer and why the service is entering fail-fast.

@zack-littke-smith-ai
Copy link
Author

I'll look into getting us more up-to-date and come back around. Is there anything else about our setup that stands out to you as being problematic, or anything else I should turn on in the meantime? This issue is quite rare for us and so cycling back here might be slow

Copy link

stale bot commented Sep 10, 2024

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix label Sep 10, 2024
@stale stale bot closed this as completed Oct 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants