Skip to content
This repository has been archived by the owner on Jun 20, 2024. It is now read-only.

Liveness probe makes debugging very hard without apparent benefit #3417

Closed
annismckenzie opened this issue Sep 27, 2018 · 6 comments
Closed
Milestone

Comments

@annismckenzie
Copy link

The Kubernetes DaemonSet uses a liveness probe to restart failed Weave pods which makes debugging issues with Weave almost impossible. We've since switched it over to a readiness probe which results in the behavior of the Weave pod not being in the ready state while an operator can then go into the Weave container and see what's wrong (weave --local status and all of those).

My question is: is there something that I'm missing on why a liveness probe was chosen?

What happened?

Failing Weave containers are continuously being reaped by the Kubelet, making it impossible to debug a problem.

Anything else we need to know?

No, this isn't a question specific to any cloud provider or hardware, just Weave on K8s using the DaemonSet.

Versions:

$ weave version
2.4.1

$ docker version
Client:
 Version:      17.03.1-ce
 API version:  1.27
 Go version:   go1.7.5
 Git commit:   c6d412e32
 Built:        Fri Mar 24 00:39:57 2017
 OS/Arch:      linux/amd64

Server:
 Version:      17.03.1-ce
 API version:  1.27 (minimum version 1.12)
 Go version:   go1.7.5
 Git commit:   c6d412e32
 Built:        Fri Mar 24 00:39:57 2017
 OS/Arch:      linux/amd64
 Experimental: false

$ uname -a
Linux 4.9.0-7-amd64 #1 SMP Debian 4.9.110-1 (2018-07-05) x86_64 GNU/Linux

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"11", GitVersion:"v1.11.2", GitCommit:"bb9ffb1654d4a729bb4cec18ff088eacc153c239", GitTreeState:"clean", BuildDate:"2018-08-07T23:17:28Z", GoVersion:"go1.10.3", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"11", GitVersion:"v1.11.2", GitCommit:"bb9ffb1654d4a729bb4cec18ff088eacc153c239", GitTreeState:"clean", BuildDate:"2018-08-07T23:08:19Z", GoVersion:"go1.10.3", Compiler:"gc", Platform:"linux/amd64"}
@murali-reddy
Copy link
Contributor

liveness and readiness probes serves different purposes, so its not a question on one over other.

The Kubernetes DaemonSet uses a liveness probe to restart failed Weave pods which makes debugging issues with Weave almost impossible

Could you please elaborate what issue you are running into? Are you not able to get to the logs of previous pods and check for any errors related to why liveness check failed?

@bboreham
Copy link
Contributor

is there something that I'm missing on why a liveness probe was chosen?

Probably not.

Conceptually there are corner cases where the process can get wedged and re-running the whole startup will un-wedge it. But I can’t recall seeing one of those in the years this thing has been live.

FWIW we had to raise the liveness timeout just last week to debug one install. We should remove it.

Since we don’t have a Service I’m not aware of any practical benefit of a readiness probe, but you’re right it could be useful in telling an operator that something needs looking into.

@annismckenzie
Copy link
Author

Hey Brian. We got to talk at KubeCon in May this year. Regarding the issue I was seeing (@murali-reddy) that was related to the issue that the Weave container couldn't resolve the API server via the kubernetes service at 10.96.0.1 – you and Brian also weighed in on that issue: #3363 (#3363 (comment) and the end of the thread). Debugging that problem was hell-ish because of the liveness probe. We're actually running our clusters with a DaemonSet where the liveness probe is switched to a readiness probe via an sed replacement… 🙈

@annismckenzie
Copy link
Author

Also see my search: https://github.com/weaveworks/weave/search?q=10.96.0.1&type=Issues. These are all impacted by the liveness probe and can't really be debugged with it on.

@bboreham
Copy link
Contributor

Hello again!

Sorry about that. Just one of these things you (re-)learn from experience.

@dlespiau
Copy link
Contributor

Ah! we also stumbled upon this recently, details are in the PR: #3421.

I like the idea of morphing the livenessProbe into a readiness one to still surface something is wrong. Will change the PR accordingly.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

5 participants