Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v1.7.0-beta.0 talosctl health fails due to temporary error connect: connection refused #8552

Closed
Tracked by #8549
rgl opened this issue Apr 5, 2024 · 2 comments · Fixed by #8560 or #8605
Closed
Tracked by #8549

v1.7.0-beta.0 talosctl health fails due to temporary error connect: connection refused #8552

rgl opened this issue Apr 5, 2024 · 2 comments · Fixed by #8560 or #8605
Assignees

Comments

@rgl
Copy link
Contributor

rgl commented Apr 5, 2024

Bug Report

Description

While trying the new v1.7.0-beta.0 release, the talosctl health command seems to have a regression relative to v1.6.7. I think it should have ignored the temporary errors, and only return when the cluster is healthy.

I'm not sure if this is related to spurious 'Connection closing' errors in integration tests mentioned in #8549.

Logs

Immediately after launching the cluster with terraform, calling talosctl health systematically fails with the following error.

# talosctl -e 10.17.3.10 -n 10.17.3.10 health --control-plane-nodes 10.17.3.10 --worker-nodes 10.17.3.20
discovered nodes: ["10.17.3.10" "10.17.3.20"]
waiting for etcd to be healthy: ...
waiting for etcd to be healthy: 1 error occurred:
	* 10.17.3.10: service "etcd" not in expected state "Running": current state [Preparing] Running pre state
waiting for etcd to be healthy: 1 error occurred:
	* 10.17.3.10: service is not healthy: etcd
waiting for etcd to be healthy: OK
waiting for etcd members to be consistent across nodes: ...
waiting for etcd members to be consistent across nodes: OK
waiting for etcd members to be control plane nodes: ...
waiting for etcd members to be control plane nodes: OK
waiting for apid to be ready: ...
waiting for apid to be ready: 1 error occurred:
	* 10.17.3.20: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 10.17.3.20:50000: connect: connection refused"
healthcheck error: rpc error: code = Canceled desc = grpc: the client connection is closing

FWIW, after manually waiting for the cluster to be actually healthy, calling talosctl health works as expected.

Comparing to v1.6.7, which also shows that error, but ignores it, here's the v1.6.7 output:

discovered nodes: ["10.17.3.10" "10.17.3.20"]
waiting for etcd to be healthy: ...
waiting for etcd to be healthy: 1 error occurred:
	* 10.17.3.10: service "etcd" not in expected state "Running": current state [Preparing] Running pre state
waiting for etcd to be healthy: 1 error occurred:
	* 10.17.3.10: service is not healthy: etcd
waiting for etcd to be healthy: OK
waiting for etcd members to be consistent across nodes: ...
waiting for etcd members to be consistent across nodes: OK
waiting for etcd members to be control plane nodes: ...
waiting for etcd members to be control plane nodes: OK
waiting for apid to be ready: ...
waiting for apid to be ready: 1 error occurred:
	* 10.17.3.20: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 10.17.3.20:50000: connect: connection refused"
waiting for apid to be ready: OK
waiting for all nodes memory sizes: ...
waiting for all nodes memory sizes: OK
waiting for all nodes disk sizes: ...
waiting for all nodes disk sizes: OK
waiting for kubelet to be healthy: ...
waiting for kubelet to be healthy: OK
waiting for all nodes to finish boot sequence: ...
waiting for all nodes to finish boot sequence: OK
waiting for all k8s nodes to report: ...
waiting for all k8s nodes to report: can't find expected node with IPs ["10.17.3.10"]
waiting for all k8s nodes to report: OK
waiting for all k8s nodes to report ready: ...
waiting for all k8s nodes to report ready: some nodes are not ready: [c0 w0]
waiting for all k8s nodes to report ready: some nodes are not ready: [w0]
waiting for all k8s nodes to report ready: OK
waiting for all control plane static pods to be running: ...
waiting for all control plane static pods to be running: OK
waiting for all control plane components to be ready: ...
waiting for all control plane components to be ready: expected number of pods for kube-apiserver to be 1, got 0
waiting for all control plane components to be ready: OK
waiting for kube-proxy to report ready: ...
waiting for kube-proxy to report ready: SKIP
waiting for coredns to report ready: ...
waiting for coredns to report ready: OK
waiting for all k8s nodes to report schedulable: ...
waiting for all k8s nodes to report schedulable: OK

Environment

  • Talos version:
Client:
	Tag:         v1.7.0-beta.0
	SHA:         78f97137
	Built:       
	Go version:  go1.22.2
	OS/Arch:     linux/amd64
Server:
	NODE:        10.17.3.10
	Tag:         v1.7.0-beta.0
	SHA:         78f97137
	Built:       
	Go version:  go1.22.2
	OS/Arch:     linux/amd64
	Enabled:     RBAC
  • Kubernetes version:
Client Version: v1.29.3
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.29.3
  • Platform: nocloud in libvirt

The full terraform program is at https://github.com/rgl/terraform-libvirt-talos/tree/upgrade-to-talos-1.7.0-beta.0.

@smira smira self-assigned this Apr 8, 2024
smira added a commit to smira/talos that referenced this issue Apr 12, 2024
Fixes siderolabs#8552

When `apid` notices update in the PKI, it flushes its client connections
to other machines (used for proxying), as it might need to use new
client certificate.

While flushing, just calling `Close` might abort already running
connections.

So instead, try to close gracefully with a timeout when the connection
is idle.

Signed-off-by: Andrey Smirnov <[email protected]>
(cherry picked from commit 336e611)
@rgl
Copy link
Contributor Author

rgl commented Apr 13, 2024

@smira, I've just tried 1.7.0-beta.1 and the error reported in this issue is still happening.

@smira
Copy link
Member

smira commented Apr 15, 2024

yep, I saw that in the integration tests as well, probably the fix is not complete.

@smira smira reopened this Apr 15, 2024
smira added a commit to smira/talos that referenced this issue Apr 16, 2024
Fixes siderolabs#8552

This fixes up the previous fix where `for` condition was inverted, and
also updates the idle timeout, so that the transition to idle happens
before the timeout expires.

Signed-off-by: Andrey Smirnov <[email protected]>
smira added a commit to smira/talos that referenced this issue Apr 19, 2024
Fixes siderolabs#8552

This fixes up the previous fix where `for` condition was inverted, and
also updates the idle timeout, so that the transition to idle happens
before the timeout expires.

Signed-off-by: Andrey Smirnov <[email protected]>
(cherry picked from commit 5d07ac5)
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jun 16, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
2 participants