Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lock is not released if reboot of node is blocked #792

Closed
dhedberg opened this issue Jul 17, 2023 · 3 comments · Fixed by #819
Closed

Lock is not released if reboot of node is blocked #792

dhedberg opened this issue Jul 17, 2023 · 3 comments · Fixed by #819
Milestone

Comments

@dhedberg
Copy link

We want to prevent automatic reboots of any nodes where a particular set of pods are running. To that end we have configured a --blocking-pod-selector. The idea being that while that particular node will not be rebooted at the moment, any other nodes will still be allowed to reboot if they can.

The issue with that seems to be that the lock is taken before the check for blocking pods, but not released when a blocking pod is found. Since the lock is still being held, this prevents any other nodes that could reboot from doing so.

Have I misunderstood anything? Is this a bug or the expected behavior?

@ckotzbauer
Copy link
Member

Yeah, when reading the code, it seems that the lock is acquired and not released in case that the reboot is blocked.
There are two possibilities for beeing blocked:

  • --alert-filter-regexp which queries prometheus and is independent of the current processed node (reboots are blocked in the whole cluster)
  • --blocking-pod-selector which is looking for specific pods on the processed node.

In case of the prometheus-query it is okay that the lock is held for longer and not released again as the reboots are blocked anyway. For the pod-selector it seems better to change the behaviour to release the lock again.

WDYT @jackfrancis?

@jackfrancis
Copy link
Collaborator

After re-reading the code a few times, I think the appropriate thing to do in this case is to simply check for blocking conditions prior to acquiring the lock. I've opened up a PR here that does that:

#819

Happy to let @ckotzbauer weigh in on this, as we now have two competing solutions for this issue. :)

@ckotzbauer
Copy link
Member

I merged your PR @jackfrancis, as I really like the approach of first checking for blockers before acquiring. Thanks for your thoughts!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants