The liveness probe will fail when the machine’s memory usage is high #10138

x-coder-L · 2024-05-23T09:46:57Z

Environmental Info:
K3s Version:

k3s version v1.29.2+k3s1 (86f1021)
go version go1.21.7
Node(s) CPU architecture, OS, and Version:

Linux 5.17.15-1.el8.x86_64 #1 SMP PREEMPT Wed Jun 15 02:07:24 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
Cluster Configuration:

1 servers
Describe the bug:

When the machine’s memory usage exceeds 85% (despite there still being sufficient memory for k3s to allocate), the pod may fail the liveness probe, displaying the error message ‘Get “http:xxx”: context deadline exceeded (Client.Timeout exceeded while awaiting headers)’. However, when using curl to test the service simultaneously, it can provide the correct response.
Steps To Reproduce:

Installed K3s:
k3s.io/node-args": "["server","--kubelet-arg","kube-reserved=memory=2Gi","--kubelet-arg","system-reserved=memory=32Gi","--kubelet-arg","sync-frequency=1s","--kube-apiserver-arg","event-ttl=48h0m0s","--flannel-backend","none","--node-name","localhost","--disable-helm-controller"]
reproduce:
When a pod with a qosclass type of BestEffort consumes a large amount of memory, causing the machine’s memory usage to exceed 85% but not triggering the k3s eviction conditions or reaching the k3s oom limit, we observe the situation of the pod using the liveness probe. An error message ‘Get “http:xxx”: context deadline exceeded (Client.Timeout exceeded while awaiting headers)’ will appear.
Expected behavior:

The error message ‘Get “http:xxx”: context deadline exceeded (Client.Timeout exceeded while awaiting headers)’ won't appear when the machine’s memory usage to exceed 85% but not triggering the k3s eviction conditions or reaching the k3s oom limit
Additional context / logs:

3h25m       Warning   Unhealthy                pod/metrics-server-67c658944b-rt25v                              Readiness probe failed: Get "https://10.42.0.20:10250/readyz": net/http: request canceled (Client.Timeout exceeded while awaiting headers)
3h24m       Warning   Unhealthy                pod/metrics-server-67c658944b-rt25v                              Liveness probe failed: Get "https://10.42.0.20:10250/livez": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
3h17m       Warning   Unhealthy                pod/metrics-server-67c658944b-rt25v                              Liveness probe failed: Get "https://10.42.0.20:10250/livez": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
3h17m       Warning   Unhealthy                pod/metrics-server-67c658944b-rt25v                              Readiness probe failed: Get "https://10.42.0.20:10250/readyz": context deadline exceeded
3h17m       Warning   Unhealthy                pod/metrics-server-67c658944b-rt25v                              Readiness probe failed: Get "https://10.42.0.20:10250/readyz": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
3h17m       Warning   Unhealthy                pod/metrics-server-67c658944b-rt25v                              Readiness probe failed: Get "https://10.42.0.20:10250/readyz": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
3h17m       Warning   Unhealthy                pod/coredns-5f4f9b8989-gxk68                                     Liveness probe failed: Get "http://10.42.0.2:8080/health": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
3h17m       Warning   Unhealthy                pod/coredns-5f4f9b8989-gxk68                                     Readiness probe failed: Get "http://10.42.0.2:8181/ready": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

The text was updated successfully, but these errors were encountered:

brandond · 2024-05-23T21:47:42Z

I don't see that this is an issue with k3s itself, or something that we can fix in this project. I'm not sure what we're supposed to do if the node lacks sufficient resource such that the workload becomes unresponsive, or the kubelet is unable to complete the request in a timely manner due to resource contention with other processes.

Do you have swap enabled on your node? Is it perhaps thrashing on swap, making it look like there's more memory available than you actually have?

brandond closed this as completed May 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The liveness probe will fail when the machine’s memory usage is high #10138

The liveness probe will fail when the machine’s memory usage is high #10138

x-coder-L commented May 23, 2024 •

edited

Loading

brandond commented May 23, 2024 •

edited

Loading

The liveness probe will fail when the machine’s memory usage is high #10138

The liveness probe will fail when the machine’s memory usage is high #10138

Comments

x-coder-L commented May 23, 2024 • edited Loading

brandond commented May 23, 2024 • edited Loading

x-coder-L commented May 23, 2024 •

edited

Loading

brandond commented May 23, 2024 •

edited

Loading