Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

talos_cluster_health apparently requires control_plane_nodes to be IP addresses #143

Open
michaelbeaumont opened this issue Dec 27, 2023 · 5 comments

Comments

@michaelbeaumont
Copy link

michaelbeaumont commented Dec 27, 2023

I see an error like the following when using talos_cluster_health:

rpc error: code = Unknown desc = ParseAddr("services-cp-0"): unable to parse IP

The error message isn't clear but a little testing shows this is coming from the control_plane_nodes value. This requirement is in contrast to other data sources that take nodes like talos_client_configuration or talos_cluster_kubeconfig.

Additionally, I think examples of exactly how to use talos_cluster_health to do what talos_cluster_kubeconfig.wait did would be helpful. Replacing an argument with a data source deserves explanation.

@zargony
Copy link

zargony commented Dec 29, 2023

I ran into the same issue and was wondering what's best practice with endpoint and nodes arguments.

In my talosconfig, I'm using fqdn (at home) or IPs (in cloud) as endpoints and hostnames as nodes. It's easier to distinguish nodes by their name rather than remembering IPs, e.g. with talosctl -n or when reading outputs. So I did the same in terraform rules, but now that this issue came up, I wonder what's best practice for specifying endpoints and nodes.

@zargony
Copy link

zargony commented Dec 29, 2023

I just found out that the ip addresses given in control_plane_nodes are not only used for connecting but also to verify whether they're etcd members.

I can imagine that this might be a bit too strict in some cases. E.g. when control plane nodes fail but etcd still has quorum, this health check will fail and potentially block the applying of changes to fix the situation. In my case, the health check took forever and failed because I used the control plane's IPv4 addresses while etcd members use IPv6 addresses. Wouldn't it be more appropriate to relax the check and just check whether etcd has quorum? (This also wouldn't require to know the exact etcd members IPs). Or maybe make it optionally possible to check for quorum only (e.g. by not giving control plane IPs or by setting an option)?

@frezbo
Copy link
Member

frezbo commented Dec 29, 2023

I just found out that the ip addresses given in control_plane_nodes are not only used for connecting but also to verify whether they're etcd members.

I can imagine that this might be a bit too strict in some cases. E.g. when control plane nodes fail but etcd still has quorum, this health check will fail and potentially block the applying of changes to fix the situation. In my case, the health check took forever and failed because I used the control plane's IPv4 addresses while etcd members use IPv6 addresses. Wouldn't it be more appropriate to relax the check and just check whether etcd has quorum? (This also wouldn't require to know the exact etcd members IPs). Or maybe make it optionally possible to check for quorum only (e.g. by not giving control plane IPs or by setting an option)?

The checks are currently designed for a full cluster wide health (cluster here does not mean kubernetes, but the whole talos cluster).

Etcd advertise subnets can be user defined to specify which addresses to listen on, so it's entirely user customizable, otherwise talos would just try to pick a default

@smira
Copy link
Member

smira commented Dec 29, 2023

@frezbo I think the problem is that the underlying Talos health check is not flexible enough for multi-homed clusters, it assumes a single IP per node. this could be fixed of course.

@zargony
Copy link

zargony commented Jan 22, 2024

FYI I was able to work around my issue (control plane using IPv6 addresses not known to Terraform) by using talos_cluster_health with control_plane_nodes = [] which still seems to do some checks (at least it waits for the cluster to become ready when creating the control plane)

(Yes, etcd advertise subnets can be configured. I intentionally set it to 2000::/3 on my cloud servers since IPv4 might not always be available on all of them)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants