v1.7.0-beta.0 talosctl health fails due to temporary error connect: connection refused #8552

rgl · 2024-04-05T16:13:03Z

Bug Report

Description

While trying the new v1.7.0-beta.0 release, the talosctl health command seems to have a regression relative to v1.6.7. I think it should have ignored the temporary errors, and only return when the cluster is healthy.

I'm not sure if this is related to spurious 'Connection closing' errors in integration tests mentioned in #8549.

Logs

Immediately after launching the cluster with terraform, calling talosctl health systematically fails with the following error.

# talosctl -e 10.17.3.10 -n 10.17.3.10 health --control-plane-nodes 10.17.3.10 --worker-nodes 10.17.3.20
discovered nodes: ["10.17.3.10" "10.17.3.20"]
waiting for etcd to be healthy: ...
waiting for etcd to be healthy: 1 error occurred:
	* 10.17.3.10: service "etcd" not in expected state "Running": current state [Preparing] Running pre state
waiting for etcd to be healthy: 1 error occurred:
	* 10.17.3.10: service is not healthy: etcd
waiting for etcd to be healthy: OK
waiting for etcd members to be consistent across nodes: ...
waiting for etcd members to be consistent across nodes: OK
waiting for etcd members to be control plane nodes: ...
waiting for etcd members to be control plane nodes: OK
waiting for apid to be ready: ...
waiting for apid to be ready: 1 error occurred:
	* 10.17.3.20: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 10.17.3.20:50000: connect: connection refused"
healthcheck error: rpc error: code = Canceled desc = grpc: the client connection is closing

FWIW, after manually waiting for the cluster to be actually healthy, calling talosctl health works as expected.

Comparing to v1.6.7, which also shows that error, but ignores it, here's the v1.6.7 output:

discovered nodes: ["10.17.3.10" "10.17.3.20"]
waiting for etcd to be healthy: ...
waiting for etcd to be healthy: 1 error occurred:
	* 10.17.3.10: service "etcd" not in expected state "Running": current state [Preparing] Running pre state
waiting for etcd to be healthy: 1 error occurred:
	* 10.17.3.10: service is not healthy: etcd
waiting for etcd to be healthy: OK
waiting for etcd members to be consistent across nodes: ...
waiting for etcd members to be consistent across nodes: OK
waiting for etcd members to be control plane nodes: ...
waiting for etcd members to be control plane nodes: OK
waiting for apid to be ready: ...
waiting for apid to be ready: 1 error occurred:
	* 10.17.3.20: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 10.17.3.20:50000: connect: connection refused"
waiting for apid to be ready: OK
waiting for all nodes memory sizes: ...
waiting for all nodes memory sizes: OK
waiting for all nodes disk sizes: ...
waiting for all nodes disk sizes: OK
waiting for kubelet to be healthy: ...
waiting for kubelet to be healthy: OK
waiting for all nodes to finish boot sequence: ...
waiting for all nodes to finish boot sequence: OK
waiting for all k8s nodes to report: ...
waiting for all k8s nodes to report: can't find expected node with IPs ["10.17.3.10"]
waiting for all k8s nodes to report: OK
waiting for all k8s nodes to report ready: ...
waiting for all k8s nodes to report ready: some nodes are not ready: [c0 w0]
waiting for all k8s nodes to report ready: some nodes are not ready: [w0]
waiting for all k8s nodes to report ready: OK
waiting for all control plane static pods to be running: ...
waiting for all control plane static pods to be running: OK
waiting for all control plane components to be ready: ...
waiting for all control plane components to be ready: expected number of pods for kube-apiserver to be 1, got 0
waiting for all control plane components to be ready: OK
waiting for kube-proxy to report ready: ...
waiting for kube-proxy to report ready: SKIP
waiting for coredns to report ready: ...
waiting for coredns to report ready: OK
waiting for all k8s nodes to report schedulable: ...
waiting for all k8s nodes to report schedulable: OK

Environment

Talos version:

Client:
	Tag:         v1.7.0-beta.0
	SHA:         78f97137
	Built:       
	Go version:  go1.22.2
	OS/Arch:     linux/amd64
Server:
	NODE:        10.17.3.10
	Tag:         v1.7.0-beta.0
	SHA:         78f97137
	Built:       
	Go version:  go1.22.2
	OS/Arch:     linux/amd64
	Enabled:     RBAC

Kubernetes version:

Client Version: v1.29.3
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.29.3

Platform: nocloud in libvirt

The full terraform program is at https://github.com/rgl/terraform-libvirt-talos/tree/upgrade-to-talos-1.7.0-beta.0.

The text was updated successfully, but these errors were encountered:

Fixes siderolabs#8552 When `apid` notices update in the PKI, it flushes its client connections to other machines (used for proxying), as it might need to use new client certificate. While flushing, just calling `Close` might abort already running connections. So instead, try to close gracefully with a timeout when the connection is idle. Signed-off-by: Andrey Smirnov <[email protected]> (cherry picked from commit 336e611)

rgl · 2024-04-13T08:50:09Z

@smira, I've just tried 1.7.0-beta.1 and the error reported in this issue is still happening.

smira · 2024-04-15T13:41:24Z

yep, I saw that in the integration tests as well, probably the fix is not complete.

Fixes siderolabs#8552 This fixes up the previous fix where `for` condition was inverted, and also updates the idle timeout, so that the transition to idle happens before the timeout expires. Signed-off-by: Andrey Smirnov <[email protected]>

Fixes siderolabs#8552 This fixes up the previous fix where `for` condition was inverted, and also updates the idle timeout, so that the transition to idle happens before the timeout expires. Signed-off-by: Andrey Smirnov <[email protected]> (cherry picked from commit 5d07ac5)

smira mentioned this issue Apr 5, 2024

📦 Talos 1.7.0-beta phase work #8549

Closed

smira self-assigned this Apr 8, 2024

smira mentioned this issue Apr 8, 2024

fix: close the apid connection to other machines gracefully #8560

Merged

talos-bot closed this as completed in 336e611 Apr 8, 2024

smira reopened this Apr 15, 2024

smira mentioned this issue Apr 16, 2024

fix: close apid inter-backend connections gracefully for real #8605

Merged

talos-bot closed this as completed in 5d07ac5 Apr 16, 2024

github-actions bot locked as resolved and limited conversation to collaborators Jun 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v1.7.0-beta.0 talosctl health fails due to temporary error connect: connection refused #8552

v1.7.0-beta.0 talosctl health fails due to temporary error connect: connection refused #8552

rgl commented Apr 5, 2024 •

edited

Loading

rgl commented Apr 13, 2024

smira commented Apr 15, 2024

v1.7.0-beta.0 talosctl health fails due to temporary error connect: connection refused #8552

v1.7.0-beta.0 talosctl health fails due to temporary error connect: connection refused #8552

Comments

rgl commented Apr 5, 2024 • edited Loading

Bug Report

Description

Logs

Environment

rgl commented Apr 13, 2024

smira commented Apr 15, 2024

rgl commented Apr 5, 2024 •

edited

Loading