Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Race condition while looping over hosts in template #191

Open
ksparavec opened this issue Sep 20, 2023 · 2 comments
Open

Race condition while looping over hosts in template #191

ksparavec opened this issue Sep 20, 2023 · 2 comments

Comments

@ksparavec
Copy link

There is a race condition triggered in template loop

{% for host in ansible_play_hosts %}
{%   if host != inventory_hostname %}

[Peer]
# {{ host }}
PublicKey = {{hostvars[host].wireguard__fact_public_key}}
...

What happens here is that in the loop, wireguard__fact_public_key - a fact that is dynamically generated on other hosts, is needed in the loop body, however, there is no wait statement to make sure that at the point of execution, this fact really is defined. One will normally notice this when inventory is geographically spread, like in my case where I have some hosts in local network and some in the cloud. Hosts in local network are typically faster and will reach loop in template before cloud hosts had a chance to set the fact.

Simple workaround is to put something like this in main.yml just before template task:

- name: Sleep for 10 seconds and continue with play
  ansible.builtin.wait_for:
    timeout: 10

Of course, 10 seconds wait is arbitrary and this is no real solution. Solution may involve more flexible waiting in template loop by periodically checking for fact existence before using them or, even better, removing race condition altogether so that all wireguard keys and configs are delegated to ansible controller and distributed to the clients afterwards. The second solution may raise some security concerns regarding private keys, however, one should bear in mind that if ansible controller somehow gets compromised, one probably will have much bigger problems than wireguard keys anyway.

Notice also that serializing the play won't cut it either, because if one does that, play will always fail: none of the dynamically generated facts from other hosts will be available in the template loop for obvious reason. Similarly, when executing the play with big number of hosts, say 1000, and having serial set to say 20, loop will fail too, even if all hosts are in local network, i.e. equally fast.

Explicit looping over ansible inventory is generally discouraged anyway, but I can see that in this particular case this lends itself nicely to create wireguard config in pretty elegant way. Unfortunately, elegance comes with price.

@githubixx
Copy link
Owner

Thanks for your comment. I actually never had this problem also having hosts in the Cloud and locally. But yeah, with a slow line or lots of hosts it might happen. Make the timeout for ansible.builtin.wait_for a variable would at least give people the possibility to configure it. But as you said it's not really that reliable. So checking the facts would be way nicer. If I'll find some time I'll have a look. But currently I'm pretty busy with other stuff. Also it's not that easy to reproduce. You can basically only assume that it works after the change if the role itself produces the same output 😉

@ksparavec
Copy link
Author

Well, yes, triggering race conditions is really use-case dependent. Here it will happen as soon as you have hosts on two separate locations because of network delay which is usually at least one order of magnitude higher than delays in local network or local host used as supervisor.

I am working currently on a project that needs full mesh VPN connectivity between several locations, cloud being one of them. I have hit the issue immediately upon first role execution. So, actually, it is not that difficult to reproduce. For test purposes, you can just set serial: 1 in the playbook and run it for your use case. It will fail on the first host, because it does not have facts from other hosts. Of course, wait workaround won't help in this case to fix it. But delegating key generation to Ansible controller would. If I find some time, I might provide full patch to you. It would certainly be less work than writing my own role from scratch ;-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants