`talos_cluster_health` apparently requires `control_plane_nodes` to be IP addresses #143

michaelbeaumont · 2023-12-27T18:03:44Z

I see an error like the following when using talos_cluster_health:

rpc error: code = Unknown desc = ParseAddr("services-cp-0"): unable to parse IP

The error message isn't clear but a little testing shows this is coming from the control_plane_nodes value. This requirement is in contrast to other data sources that take nodes like talos_client_configuration or talos_cluster_kubeconfig.

Additionally, I think examples of exactly how to use talos_cluster_health to do what talos_cluster_kubeconfig.wait did would be helpful. Replacing an argument with a data source deserves explanation.

The text was updated successfully, but these errors were encountered:

zargony · 2023-12-29T10:00:31Z

I ran into the same issue and was wondering what's best practice with endpoint and nodes arguments.

In my talosconfig, I'm using fqdn (at home) or IPs (in cloud) as endpoints and hostnames as nodes. It's easier to distinguish nodes by their name rather than remembering IPs, e.g. with talosctl -n or when reading outputs. So I did the same in terraform rules, but now that this issue came up, I wonder what's best practice for specifying endpoints and nodes.

zargony · 2023-12-29T12:36:17Z

I just found out that the ip addresses given in control_plane_nodes are not only used for connecting but also to verify whether they're etcd members.

I can imagine that this might be a bit too strict in some cases. E.g. when control plane nodes fail but etcd still has quorum, this health check will fail and potentially block the applying of changes to fix the situation. In my case, the health check took forever and failed because I used the control plane's IPv4 addresses while etcd members use IPv6 addresses. Wouldn't it be more appropriate to relax the check and just check whether etcd has quorum? (This also wouldn't require to know the exact etcd members IPs). Or maybe make it optionally possible to check for quorum only (e.g. by not giving control plane IPs or by setting an option)?

frezbo · 2023-12-29T13:22:34Z

I just found out that the ip addresses given in control_plane_nodes are not only used for connecting but also to verify whether they're etcd members.

I can imagine that this might be a bit too strict in some cases. E.g. when control plane nodes fail but etcd still has quorum, this health check will fail and potentially block the applying of changes to fix the situation. In my case, the health check took forever and failed because I used the control plane's IPv4 addresses while etcd members use IPv6 addresses. Wouldn't it be more appropriate to relax the check and just check whether etcd has quorum? (This also wouldn't require to know the exact etcd members IPs). Or maybe make it optionally possible to check for quorum only (e.g. by not giving control plane IPs or by setting an option)?

The checks are currently designed for a full cluster wide health (cluster here does not mean kubernetes, but the whole talos cluster).

Etcd advertise subnets can be user defined to specify which addresses to listen on, so it's entirely user customizable, otherwise talos would just try to pick a default

smira · 2023-12-29T14:12:10Z

@frezbo I think the problem is that the underlying Talos health check is not flexible enough for multi-homed clusters, it assumes a single IP per node. this could be fixed of course.

zargony · 2024-01-22T10:59:11Z

FYI I was able to work around my issue (control plane using IPv6 addresses not known to Terraform) by using talos_cluster_health with control_plane_nodes = [] which still seems to do some checks (at least it waits for the cluster to become ready when creating the control plane)

(Yes, etcd advertise subnets can be configured. I intentionally set it to 2000::/3 on my cloud servers since IPv4 might not always be available on all of them)

zargony mentioned this issue Dec 29, 2023

talos_cluster_health takes 5s to pass #145

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`talos_cluster_health` apparently requires `control_plane_nodes` to be IP addresses #143

`talos_cluster_health` apparently requires `control_plane_nodes` to be IP addresses #143

michaelbeaumont commented Dec 27, 2023 •

edited

Loading

zargony commented Dec 29, 2023

zargony commented Dec 29, 2023

frezbo commented Dec 29, 2023

smira commented Dec 29, 2023

zargony commented Jan 22, 2024

talos_cluster_health apparently requires control_plane_nodes to be IP addresses #143

talos_cluster_health apparently requires control_plane_nodes to be IP addresses #143

Comments

michaelbeaumont commented Dec 27, 2023 • edited Loading

zargony commented Dec 29, 2023

zargony commented Dec 29, 2023

frezbo commented Dec 29, 2023

smira commented Dec 29, 2023

zargony commented Jan 22, 2024

`talos_cluster_health` apparently requires `control_plane_nodes` to be IP addresses #143

`talos_cluster_health` apparently requires `control_plane_nodes` to be IP addresses #143

michaelbeaumont commented Dec 27, 2023 •

edited

Loading