Connection refused randomly for pairs of pods #12681

zack-littke-smith-ai · 2024-06-04T23:31:05Z

What is the issue?

I am running into a really difficult-to-reproduce issue where our k8s pod will somehow decide it will not serve certain clients, giving logs in the client proxy:

WARN ThreadId(01) linkerd_reconnect: Failed to connect error=Connection refused (os error 111)

And:

INFO ThreadId(01) outbound:proxy{addr=10.100.32.3:10079}:rescue{client.addr=172.28.187.94:55562}: linkerd_app_core::errors::respond: gRPC request failed error=logical service service-name.namespace.svc.cluster.local:10079: service unavailable error.sources=[service unavailable]

However during this time, the service does successfully connect to other clients and serve their requests descriminately. Restarting the clients has no effect, and restarting the service can 'sometimes' help, resulting in reconnection to some clients but failure to reconnect to others.

The only 'solution' we've seen success with is restarting every single linkerd container and proxy-having service, which is not ideal to say the least.

While I have no solid repro, I'm hoping to at least take away some debugging tips for the next time this happens to us.

How can it be reproduced?

Unfortunately I have not been able to reliably reproduce in our own environments

Logs, error output, etc

Proxy logs from the service:

[ 0.001766s] INFO ThreadId(01) linkerd2_proxy: release 2.210.0 (85db2fc) by linkerd on 2023-09-21T21:24:58Z
[ 0.002498s] INFO ThreadId(01) linkerd2_proxy::rt: Using single-threaded proxy runtime
[ 0.003107s] INFO ThreadId(01) linkerd2_proxy: Admin interface on 0.0.0.0:4191
[ 0.003116s] INFO ThreadId(01) linkerd2_proxy: Inbound interface on 0.0.0.0:4143
[ 0.003118s] INFO ThreadId(01) linkerd2_proxy: Outbound interface on 127.0.0.1:4140
[ 0.003121s] INFO ThreadId(01) linkerd2_proxy: Tap interface on 0.0.0.0:4190
[ 0.003122s] INFO ThreadId(01) linkerd2_proxy: Local identity is default.namespace.serviceaccount.identity.linkerd.cluster.local
[ 0.003124s] INFO ThreadId(01) linkerd2_proxy: Identity verified via linkerd-identity-headless.linkerd.svc.cluster.local:8080 (linkerd-identity.linkerd.serviceaccount.identity.linkerd.cluster.local)
[ 0.003126s] INFO ThreadId(01) linkerd2_proxy: Destinations resolved via linkerd-dst-headless.linkerd.svc.cluster.local:8086 (linkerd-destination.linkerd.serviceaccount.identity.linkerd.cluster.local)
[ 0.019669s] INFO ThreadId(02) daemon:identity: linkerd_app: Certified identity id=default.namespace.serviceaccount.identity.linkerd.cluster.local
[ 0.001800s] INFO ThreadId(01) linkerd2_proxy: release 2.210.0 (85db2fc) by linkerd on 2023-09-21T21:24:58Z
[ 0.002498s] INFO ThreadId(01) linkerd2_proxy::rt: Using single-threaded proxy runtime
[ 0.003148s] INFO ThreadId(01) linkerd2_proxy: Admin interface on 0.0.0.0:4191
[ 0.003164s] INFO ThreadId(01) linkerd2_proxy: Inbound interface on 0.0.0.0:4143
[ 0.003166s] INFO ThreadId(01) linkerd2_proxy: Outbound interface on 127.0.0.1:4140
[ 0.003168s] INFO ThreadId(01) linkerd2_proxy: Tap interface on 0.0.0.0:4190
[ 0.003171s] INFO ThreadId(01) linkerd2_proxy: Local identity is default.namespace.serviceaccount.identity.linkerd.cluster.local
[ 0.003173s] INFO ThreadId(01) linkerd2_proxy: Identity verified via linkerd-identity-headless.linkerd.svc.cluster.local:8080 (linkerd-identity.linkerd.serviceaccount.identity.linkerd.cluster.local)
[ 0.003175s] INFO ThreadId(01) linkerd2_proxy: Destinations resolved via linkerd-dst-headless.linkerd.svc.cluster.local:8086 (linkerd-destination.linkerd.serviceaccount.identity.linkerd.cluster.local)
[ 0.012067s] INFO ThreadId(02) daemon:identity: linkerd_app: Certified identity id=default.namespace.serviceaccount.identity.linkerd.cluster.local

Logs from the client proxy included above

output of `linkerd check -o short`

---------------
‼ cli is up-to-date
    unsupported version channel: stable-2.14.1
    see https://linkerd.io/2.14/checks/#l5d-version-cli for hints

control-plane-version
---------------------
‼ control plane is up-to-date
    unsupported version channel: stable-2.14.1
    see https://linkerd.io/2.14/checks/#l5d-version-control for hints

linkerd-control-plane-proxy
---------------------------
‼ control plane proxies are up-to-date
    some proxies are not running the current version:
        * linkerd-destination-6954bdcf79-6p7z5 (3cd7d7a0849f124af2156783ae1989d0a1248d412341cd97f781e60feae98dda)
	* linkerd-destination-6954bdcf79-df9f2 (3cd7d7a0849f124af2156783ae1989d0a1248d412341cd97f781e60feae98dda)
	* linkerd-destination-6954bdcf79-jnncs (3cd7d7a0849f124af2156783ae1989d0a1248d412341cd97f781e60feae98dda)
	* linkerd-identity-5958cdbd64-gc2qp (3cd7d7a0849f124af2156783ae1989d0a1248d412341cd97f781e60feae98dda)
	* linkerd-identity-5958cdbd64-ph8v8 (3cd7d7a0849f124af2156783ae1989d0a1248d412341cd97f781e60feae98dda)
	* linkerd-identity-5958cdbd64-qsh5m (3cd7d7a0849f124af2156783ae1989d0a1248d412341cd97f781e60feae98dda)
	* linkerd-proxy-injector-7664c7cf84-77vl9 (3cd7d7a0849f124af2156783ae1989d0a1248d412341cd97f781e60feae98dda)
	* linkerd-proxy-injector-7664c7cf84-khhfp (3cd7d7a0849f124af2156783ae1989d0a1248d412341cd97f781e60feae98dda)
	* linkerd-proxy-injector-7664c7cf84-xzz9x (3cd7d7a0849f124af2156783ae1989d0a1248d412341cd97f781e60feae98dda)
    see https://linkerd.io/2.14/checks/#l5d-cp-proxy-version for hints
‼ control plane proxies and cli versions match
    linkerd-destination-6954bdcf79-6p7z5 running 3cd7d7a0849f124af2156783ae1989d0a1248d412341cd97f781e60feae98dda but cli running stable-2.14.1
    see https://linkerd.io/2.14/checks/#l5d-cp-proxy-cli-version for hints

linkerd-viz
-----------
‼ viz extension proxies are up-to-date
    some proxies are not running the current version:
	* grafana-6c4c8b997d-ptswf (3cd7d7a0849f124af2156783ae1989d0a1248d412341cd97f781e60feae98dda)
	* metrics-api-7d685f8896-f4d52 (3cd7d7a0849f124af2156783ae1989d0a1248d412341cd97f781e60feae98dda)
	* prometheus-dd8b5b7f4-2rsgn (3cd7d7a0849f124af2156783ae1989d0a1248d412341cd97f781e60feae98dda)
	* tap-59769cd568-7t92z (3cd7d7a0849f124af2156783ae1989d0a1248d412341cd97f781e60feae98dda)
	* tap-injector-6f987fddf9-f9fs5 (3cd7d7a0849f124af2156783ae1989d0a1248d412341cd97f781e60feae98dda)
	* web-7c6ff5b7d-7tdb6 (3cd7d7a0849f124af2156783ae1989d0a1248d412341cd97f781e60feae98dda)
    see https://linkerd.io/2.14/checks/#l5d-viz-proxy-cp-version for hints
‼ viz extension proxies and cli versions match
    grafana-6c4c8b997d-ptswf running 3cd7d7a0849f124af2156783ae1989d0a1248d412341cd97f781e60feae98dda but cli running stable-2.14.1
    see https://linkerd.io/2.14/checks/#l5d-viz-proxy-cli-version for hints

Status check results are √

Environment

linkerd_controller: stable-2.14.1
linkerd_debug: stable-2.14.1
linkerd_grafana: stable-2.11.1
linkerd_metrics_api: stable-2.14.1
linkerd_policy_controller: stable-2.14.1
linkerd_proxy: stable-2.14.1
linkerd_proxy_init: v2.2.3
linkerd_tap: stable-2.14.1
linkerd_web: stable-2.14.1

Possible solution

No response

Additional context

No response

Would you like to work on fixing this bug?

None

The text was updated successfully, but these errors were encountered:

adleong · 2024-06-05T22:23:47Z

Hi @zack-littke-smith! I'd recommend looking at the full client proxy logs, beyond those two log lines in particular. The Linkerd proxy will log when addresses are added to its load balancers so the first thing I'd look into is if the correct addresses for service-name.namespace.svc.cluster.local:10079 have been added to the client proxy's load balancer.

zack-littke-smith-ai · 2024-06-05T23:13:11Z

Before we see errors, we have the following client logs:

[ 0.001866s] INFO ThreadId(01) linkerd2_proxy: release 2.210.0 (85db2fc) by linkerd on 2023-09-21T21:24:58Z
[ 0.002681s] INFO ThreadId(01) linkerd2_proxy::rt: Using single-threaded proxy runtime
[ 0.003389s] INFO ThreadId(01) linkerd2_proxy: Admin interface on 0.0.0.0:4191
[ 0.003426s] INFO ThreadId(01) linkerd2_proxy: Inbound interface on 0.0.0.0:4143
[ 0.003430s] INFO ThreadId(01) linkerd2_proxy: Outbound interface on 127.0.0.1:4140
[ 0.003432s] INFO ThreadId(01) linkerd2_proxy: Tap interface on 0.0.0.0:4190
[ 0.003434s] INFO ThreadId(01) linkerd2_proxy: Local identity is default.namespace.serviceaccount.identity.linkerd.cluster.local
[ 0.003436s] INFO ThreadId(01) linkerd2_proxy: Identity verified via linkerd-identity-headless.linkerd.svc.cluster.local:8080 (linkerd-identity.linkerd.serviceaccount.identity.linkerd.cluster.local)
[ 0.003438s] INFO ThreadId(01) linkerd2_proxy: Destinations resolved via linkerd-dst-headless.linkerd.svc.cluster.local:8086 (linkerd-destination.linkerd.serviceaccount.identity.linkerd.cluster.local)
[ 0.015661s] INFO ThreadId(02) daemon:identity: linkerd_app: Certified identity id=default.namespace.serviceaccount.identity.linkerd.cluster.local
[ 6.059523s] WARN ThreadId(01) outbound:proxy{addr=10.100.32.3:10079}: linkerd_stack::failfast: Service entering failfast after 3s

// First error here:
[ 6.059608s] INFO ThreadId(01) outbound:proxy{addr=10.100.32.3:10079}:rescue{client.addr=172.28.157.182:50400}: linkerd_app_core::errors::respond: gRPC request failed error=logical service simian-config.namespace.svc.cluster.local:10079: service in fail-fast error.sources=[service in fail-fast]

We also see the following additional failures which I didn't notice before and didn't link above:

[ 89.602769s] WARN ThreadId(01) linkerd_reconnect: Service failed error=channel closed

adleong · 2024-06-05T23:28:25Z

Ah, the proxy logging that I referred to was added after stable-2.14.1. If you upgrade to a recent edge release, you'll have more informative proxy logging about the state of the load balancer and why the service is entering fail-fast.

zack-littke-smith-ai · 2024-06-05T23:32:40Z

I'll look into getting us more up-to-date and come back around. Is there anything else about our setup that stands out to you as being problematic, or anything else I should turn on in the meantime? This issue is quite rare for us and so cycling back here might be slow

stale · 2024-09-10T02:13:38Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

zack-littke-smith-ai added the bug label Jun 4, 2024

adleong added support and removed bug labels Jun 5, 2024

stale bot added the wontfix label Sep 10, 2024

stale bot closed this as completed Oct 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Connection refused randomly for pairs of pods #12681

Connection refused randomly for pairs of pods #12681

zack-littke-smith-ai commented Jun 4, 2024 •

edited

Loading

adleong commented Jun 5, 2024

zack-littke-smith-ai commented Jun 5, 2024 •

edited

Loading

adleong commented Jun 5, 2024

zack-littke-smith-ai commented Jun 5, 2024

stale bot commented Sep 10, 2024

Connection refused randomly for pairs of pods #12681

Connection refused randomly for pairs of pods #12681

Comments

zack-littke-smith-ai commented Jun 4, 2024 • edited Loading

What is the issue?

How can it be reproduced?

Logs, error output, etc

output of linkerd check -o short

Environment

Possible solution

Additional context

Would you like to work on fixing this bug?

adleong commented Jun 5, 2024

zack-littke-smith-ai commented Jun 5, 2024 • edited Loading

adleong commented Jun 5, 2024

zack-littke-smith-ai commented Jun 5, 2024

stale bot commented Sep 10, 2024

zack-littke-smith-ai commented Jun 4, 2024 •

edited

Loading

output of `linkerd check -o short`

zack-littke-smith-ai commented Jun 5, 2024 •

edited

Loading