Linkerd-destination crashing #8235

Anthony-Bible · 2022-04-10T23:03:46Z

What is the issue?

Linkerd-destination is crash looping on a bare-metal cluster, when I tried turning on debug logging there doesn't appear to be much information (attached below). No other pods in linkerd or other namespaces are crashing so it seems isolated with linkerd-destination

How can it be reproduced?

I created a bare metal cluster on Hetzner through Kubespray both terraform and ansible-playbook. And then get the error with both a helm install and using linkerd install | kubectl apply -f

my all.yaml: https://gist.github.com/Anthony-Bible/3e60e1d53d740d181569c82812e6d271

Logs, error output, etc

https://gist.github.com/Anthony-Bible/92d2fc4565081c0469d2f5302c3eb048

output of `linkerd check -o short`

Linkerd core checks
===================

linkerd-existence
-----------------
- pod/linkerd-destination-789cdbc9fd-bstdq container sp-validator is not ready 
Status check results are √
\ pod/linkerd-destination-789cdbc9fd-bstdq container sp-validator is not ready

When the pod is ready, everything returns all green.

Environment

Kubernetes: 1.21.11
ubuntu 20.04
linkerd: 2.11.1

Possible solution

No response

Additional context

No response

Would you like to work on fixing this bug?

No response

The text was updated successfully, but these errors were encountered:

olix0r · 2022-04-11T19:21:36Z

kubectl describe on the pod or deployment should help make it clearer which container is crashing and potentially why. There's nothing that looks like a crash in the destination controller's log. You might check the policy container.

Anthony-Bible · 2022-04-12T01:46:46Z

kubectl describe on the pod or deployment should help make it clearer which container is crashing and potentially why. There's nothing that looks like a crash in the destination controller's log. You might check the policy container.

I've restarted it a few times and it's not one container that's crashing, sometimes it's destination, other times it's sp-validator and even times where its the majority of policy container crashing. None of them look like they're erroring out:

○ kubectl logs -p -n linkerd linkerd-destination-c8c548fc4-6pnfx policy
2022-04-12T01:31:18.489524Z  INFO serve{addr=0.0.0.0:9990}: linkerd_policy_controller::admin: HTTP admin server listening addr=0.0.0.0:9990
2022-04-12T01:31:18.489550Z  INFO grpc{addr=0.0.0.0:8090 cluster_networks=[10.0.0.0/8, 100.64.0.0/10, 172.16.0.0/12, 192.168.0.0/16]}: linkerd_policy_controller: gRPC server listening addr=0.0.0.0:8090
2022-04-12T01:31:18.489637Z  INFO linkerd_policy_controller: Admission controller server listening addr=0.0.0.0:9443
2022-04-12T01:31:18.506189Z DEBUG linkerd_policy_controller_k8s_index: Ready
2022-04-12T01:37:18.383084Z DEBUG linkerd_policy_controller: Received SIGTERM
2022-04-12T01:37:18.383108Z  INFO linkerd_policy_controller: Shutting down

SP-validator:

○ kubectl logs -p -n linkerd linkerd-destination-c8c548fc4-6pnfx sp-validator 
time="2022-04-12T01:30:08Z" level=info msg="running version stable-2.11.1"
I0412 01:30:08.524393       1 merged_client_builder.go:121] Using in-cluster configuration
time="2022-04-12T01:30:08Z" level=info msg="waiting for caches to sync"
I0412 01:30:08.527949       1 shared_informer.go:270] caches populated
time="2022-04-12T01:30:08Z" level=info msg="caches synced"
time="2022-04-12T01:30:08Z" level=info msg="starting admin server on :9997"
time="2022-04-12T01:30:08Z" level=info msg="listening at :8443"
time="2022-04-12T01:41:48Z" level=info msg="shutting down webhook server"

However I did notice in the linkerd-proxy logs I get logs like this

INFO ThreadId(01) inbound: linkerd_app_core::serve: Connection closed error=TLS detection timed out
# And sometimes this:
linkerd-destination-c8c548fc4-6pnfx default-worker-1 linkerd-proxy [   598.954535s]  WARN ThreadId(01) inbound: linkerd_app_core::serve: Server failed to accept connection error=Transport endpoint is not connected (os error 107)

I checked the secrets and the tls ones for linked are there.

Anthony-Bible · 2022-04-12T01:48:45Z

Oh and for completeness, the liveness and readiness probes:

  Warning  Unhealthy  15m                    kubelet            Readiness probe failed: Get "http://10.2.96.54:9996/ready": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
  Warning  Unhealthy  15m                    kubelet            Liveness probe failed: Get "http://10.2.96.54:9997/ping": dial tcp 10.2.96.54:9997: i/o timeout (Client.Timeout exceeded while awaiting headers)
  Warning  Unhealthy  15m                    kubelet            Readiness probe failed: Get "http://10.2.96.54:9990/ready": dial tcp 10.2.96.54:9990: i/o timeout (Client.Timeout exceeded while awaiting headers)
  Warning  Unhealthy  14m (x4 over 15m)      kubelet            Readiness probe failed: Get "http://10.2.96.54:9990/ready": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
  Warning  Unhealthy  10m (x7 over 14m)      kubelet            Liveness probe failed: Get "http://10.2.96.54:9997/ping": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
  Warning  Unhealthy  5m24s (x2 over 5m34s)  kubelet            Liveness probe failed: Get "http://10.2.96.54:9996/ping": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
  Warning  Unhealthy  14s (x31 over 15m)     kubelet            Liveness probe failed: Get "http://10.2.96.54:9990/ready": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

olix0r · 2022-04-12T15:35:48Z

OK, so the controller's probes are timing out. Does this really only happen on the destination controller and no other pods?

You could try running the destination controller's proxy with debug logging... you'll have to reinstall linkerd with the proxy log level set--or you can kubectl edit -n linkerd deploy linkerd-destinationand set the LINKERD2_PROXY_LOG--to linkerd=debug,info. I'd expect more obvious errors in the logs, though, if the proxy was failing in some obvious way, though... I guess the debug logs can at least help us know whether the connections are making it into/out of the proxy.

This could be some interaction with calico? But if that was the case I'd expect errors to other pods.

Perhaps there's a node-level issue? The conntrack table filling up? Are other injected components running on the node successfully?

Anthony-Bible · 2022-04-17T17:44:37Z

I took some time and looked into it. It doesn't look like anything related to Calico or the node. Everything appears to be running on that node with no problems. And the Conntrack table isn't even close to filling up (178 entries), along with everything on the same node appears to be running. The linkerd-proxy logs don't show much except connections are getting closed from the client (?) .

Should linkerd play well with ipvs kube-proxy? I know I was also facing this issue which was due to ipvs and how it adds (external) IPs, but I'm not entirely convinced it would be a similar issue here.

olix0r · 2022-04-19T16:46:57Z

@Anthony-Bible Hm, I'm not sure. I don't think we've tested this configuration explicitly. It's possible the IPVS kube proxy isn't working well with Linkerd's proxy-init iptables rules?

I think your assessment is correct. Looking at the logs we see:

linkerd-destination-dc7b5f7b4-gctr7 default-worker-1 linkerd-proxy [     2.005111s] DEBUG ThreadId(02) daemon:admin{listen.addr=0.0.0.0:4191}:accept{client.addr=5.161.97.33:60002}: linkerd_tls::server: Peeked bytes from TCP stream sz=107
linkerd-destination-dc7b5f7b4-gctr7 default-worker-1 linkerd-proxy [     2.005186s] DEBUG ThreadId(02) daemon:admin{listen.addr=0.0.0.0:4191}:accept{client.addr=5.161.97.33:60002}: linkerd_detect: DetectResult protocol=Some(Http1) elapsed=9.097µs
linkerd-destination-dc7b5f7b4-gctr7 default-worker-1 linkerd-proxy [     2.005224s] DEBUG ThreadId(02) daemon:admin{listen.addr=0.0.0.0:4191}:accept{client.addr=5.161.97.33:60002}: linkerd_proxy_http::server: Creating HTTP service version=Http1
linkerd-destination-dc7b5f7b4-gctr7 default-worker-1 linkerd-proxy [     2.005239s] DEBUG ThreadId(02) daemon:admin{listen.addr=0.0.0.0:4191}:accept{client.addr=5.161.97.33:60002}: linkerd_proxy_http::server: Handling as HTTP version=Http1
linkerd-destination-dc7b5f7b4-gctr7 default-worker-1 linkerd-proxy [     2.005276s] DEBUG ThreadId(02) daemon:admin{listen.addr=0.0.0.0:4191}:accept{client.addr=5.161.97.33:60002}: linkerd_app_inbound::policy::authorize::http: Request authorized permit=Permit { dst: OrigDstAddr(0.0.0.0:4191), protocol: Detect { timeout: 10s }, labels: AuthzLabels { server: ServerLabel("default:all-unauthenticated"), authz: "default:all-unauthenticated" } } tls=None(NoClientHello) client=5.161.97.33:60002
linkerd-destination-dc7b5f7b4-gctr7 default-worker-1 linkerd-proxy [     2.005352s] DEBUG ThreadId(02) daemon:admin{listen.addr=0.0.0.0:4191}:accept{client.addr=5.161.97.33:60002}: linkerd_proxy_http::server: The client is shutting down the connection res=Ok(())
linkerd-destination-dc7b5f7b4-gctr7 default-worker-1 linkerd-proxy [     2.005370s] DEBUG ThreadId(02) daemon:admin{listen.addr=0.0.0.0:4191}:accept{client.addr=5.161.97.33:60002}: linkerd_app_core::serve: Connection closed

That is, the admin server gets the probe request and (presumably) when it tries to write the response, the connection has already been closed. Very weird!

Anthony-Bible · 2022-06-02T16:24:11Z

So an update, I had this start happening at work with our cluster on GKE so it's not specific to a provider. Also I did try switching to kube-proxy iptables mode and that didn't help.

VladoPortos · 2022-08-08T18:23:32Z

Same here on fresh installation on k3s, I'm getting:

Readiness probe failed: Get "http://10.42.2.122:9990/ready": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

mkrupczak3 · 2022-10-25T16:17:17Z

Had the same issue during a Linkerd workshop. Using a Civo cluster with 3 nodes

gavinclarkeuk · 2022-11-22T09:29:59Z

We're seeing the same issue on our EKS clusters. Not enough to cause the pods to shut down, but we do get a fairly steady level of noise coming from probe failures

Liveness probe failed: Get "http://10.26.110.84:9996/ping": read tcp 10.26.108.70:58692->10.26.110.84:9996: read: connection reset by peer
Readiness probe failed: Get "http://10.26.110.84:9990/ready": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
Liveness probe failed: Get "http://10.26.110.84:9996/ping": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

stale · 2023-02-20T21:05:31Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

Anthony-Bible · 2023-02-21T13:59:28Z

Just tried installing it again with 2.12.4 and still getting the same problem:

  Warning  Unhealthy  79s (x2 over 89s)  kubelet            Liveness probe failed: Get "http://10.2.93.61:9990/live": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
  Warning  Unhealthy  79s                kubelet            Liveness probe failed: Get "http://10.2.93.61:9996/ping": dial tcp 10.2.93.61:9996: i/o timeout (Client.Timeout exceeded while awaiting headers)
  Warning  Unhealthy  79s                kubelet            Readiness probe failed: Get "http://10.2.93.61:9990/ready": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
  Warning  Unhealthy  69s (x3 over 89s)  kubelet            Readiness probe failed: Get "http://10.2.93.61:9996/ready": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

Anthony-Bible · 2023-02-21T14:07:43Z

@olix0r looks like you guys have pprof, would getting a goroutine output help here, maybe something is hanging so it can't mark itself live?

olix0r · 2023-02-21T15:28:18Z

@Anthony-Bible Thanks for trying a recent version. You may also want to try with an edge release, as there are some CNI/proxy-init related changes that have not yet made it to stable.

I'm skeptical that the controller pprof is going to have anything helpful -- there aren't any indicates from the controller logs that the controller isn't healthy.

I reviewed the logs you had shared and I'm unable to spot anything that looks like the timeouts being reported by kubelet.

To resolve this we might need:

ideally, steps to repro this locally in k3d (which we use for integration testing);
- otherwise, specific configuration/steps to reproduce this in a cloud environment;
proxy logs (with log level linkerd=debug,warn) from a pod that is failing probe requests.
- the output of linkerd diagnostics proxy-metrics -n linkerd <pod> on the same pod
TCP dump output from a pod -- the linkerd debug container may help with this.

jeremychase · 2023-02-23T17:54:44Z

Possibly related to #9798

alpeb · 2023-03-06T21:05:34Z

Also tagging #9602 as related.

alpeb · 2023-03-08T21:11:17Z

@Anthony-Bible @mkrupczak3 @VladoPortos @gavinclarkeuk any chance to provide more detailed info as pointed out by @olix0r above?

Besides that, it seems probe timeouts have been a persistent issue for a long time in k8s for folks under some scenarios. kubernetes/kubernetes#89898 was just recently closed, and that might provide a resolution in an upcoming k8s release.

In the meantime, it appears that by simply setting a higher probe timeout, say 5 seconds (the default is 1), resolves the issue. So if you're still experiencing this, you can try editing the linkerd-destination Deployment and explicitly setting timeoutSeconds: 5 for all the containers probes (also don't forget to use linkerd 2.12.4).

Please let us know if this helps, so we can consider adding that timeout in future releases.

stale · 2023-09-27T05:54:20Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

Anthony-Bible added the bug label Apr 10, 2022

adleong removed the bug label Apr 12, 2022

stale bot added the wontfix label Feb 20, 2023

stale bot removed the wontfix label Feb 21, 2023

olix0r added priority/triage support labels Feb 21, 2023

jeremychase added this to the stable-2.13.0 milestone Feb 23, 2023

jeremychase added priority/P1 Planned for Release bug area/cni and removed priority/triage support area/cni labels Feb 23, 2023

alpeb self-assigned this Mar 6, 2023

kvk24680 mentioned this issue Mar 20, 2023

unable to create linkerd destination pod #10581

Closed

jakub-moravec mentioned this issue Mar 29, 2023

linkerd pods fail to start when following the get started guide #10643

Closed

risingspiral modified the milestones: stable-2.13.0, stable-2.13.x-patches Apr 1, 2023

risingspiral removed this from the stable-2.13.x-patches milestone Jun 28, 2023

stale bot added the wontfix label Sep 27, 2023

stale bot closed this as completed Oct 11, 2023

github-actions bot locked as resolved and limited conversation to collaborators Nov 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Linkerd-destination crashing #8235

Linkerd-destination crashing #8235

Anthony-Bible commented Apr 10, 2022 •

edited

Loading

olix0r commented Apr 11, 2022

Anthony-Bible commented Apr 12, 2022 •

edited

Loading

Anthony-Bible commented Apr 12, 2022

olix0r commented Apr 12, 2022

Anthony-Bible commented Apr 17, 2022 •

edited

Loading

olix0r commented Apr 19, 2022

Anthony-Bible commented Jun 2, 2022

VladoPortos commented Aug 8, 2022

mkrupczak3 commented Oct 25, 2022

gavinclarkeuk commented Nov 22, 2022

stale bot commented Feb 20, 2023

Anthony-Bible commented Feb 21, 2023

Anthony-Bible commented Feb 21, 2023

olix0r commented Feb 21, 2023

jeremychase commented Feb 23, 2023

alpeb commented Mar 6, 2023

alpeb commented Mar 8, 2023

stale bot commented Sep 27, 2023

Linkerd-destination crashing #8235

Linkerd-destination crashing #8235

Comments

Anthony-Bible commented Apr 10, 2022 • edited Loading

What is the issue?

How can it be reproduced?

Logs, error output, etc

output of linkerd check -o short

Environment

Possible solution

Additional context

Would you like to work on fixing this bug?

olix0r commented Apr 11, 2022

Anthony-Bible commented Apr 12, 2022 • edited Loading

Anthony-Bible commented Apr 12, 2022

olix0r commented Apr 12, 2022

Anthony-Bible commented Apr 17, 2022 • edited Loading

olix0r commented Apr 19, 2022

Anthony-Bible commented Jun 2, 2022

VladoPortos commented Aug 8, 2022

mkrupczak3 commented Oct 25, 2022

gavinclarkeuk commented Nov 22, 2022

stale bot commented Feb 20, 2023

Anthony-Bible commented Feb 21, 2023

Anthony-Bible commented Feb 21, 2023

olix0r commented Feb 21, 2023

jeremychase commented Feb 23, 2023

alpeb commented Mar 6, 2023

alpeb commented Mar 8, 2023

stale bot commented Sep 27, 2023

Anthony-Bible commented Apr 10, 2022 •

edited

Loading

output of `linkerd check -o short`

Anthony-Bible commented Apr 12, 2022 •

edited

Loading

Anthony-Bible commented Apr 17, 2022 •

edited

Loading