Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for excluding/including nodes from external load balancers before/after reboot #378

Closed
wants to merge 5 commits into from

Conversation

amorey
Copy link
Contributor

@amorey amorey commented May 22, 2021

This PR enables nodes that are registered to external load balancers to be restarted gracefully by removing them from the load balancers before reboot and adding them back after. This is accomplished by adding the "node.kubernetes.io/exclude-from-external-load-balancers" label to the node before reboot which triggers de-registration by the kubernetes control plane. Then after reboot the node label is removed which triggers re-registration by the control plane. In addition, a hard-coded reboot delay of 105 seconds is added to allow time for node instances to finish de-registration.

This functionality was added in a bare-bones fashion and should not be merged as-is. Here are some questions:

  • Should the node label addition/removal code go in cmd/kured/main.go?
  • Is there a better way to add/remove node labels than using Patch()?
  • Do you have any recommendations in terms of error handling / notification / retries?
  • How should we handle the case of nodes that already have the label present pre-reboot?
  • What should the command line option for external-load-balancer handling be named?
  • What should the command line option for the reboot delay be named?
  • How should this code be tested?

Thanks for your consideration! Looking forward to hearing your feedback!

@amorey
Copy link
Contributor Author

amorey commented Jun 25, 2021

@evrardjp I added a reboot-delay command line flag to allow users to delay the reboot until the instance finishes de-registering from its external load balancers. I also made some other code improvements and rebased with main. Can you take a look at the PR and let me know what else needs to be done to get this merged into main?

@amorey
Copy link
Contributor Author

amorey commented Jun 30, 2021

@evrardjp All checks have passed. What is the next step?

@evrardjp
Copy link
Collaborator

evrardjp commented Jul 28, 2021

Hello! I will have a look :)
If you are interested, there should be a community meeting today (see details in https://github.com/weaveworks/kured#getting-help), we can probably discuss this PR there.

… before/after restart, added delay to wait for de-registration to finish before rebooting
Copy link
Collaborator

@evrardjp evrardjp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe it would be better if you could split better your commits, and explain WHY you're doing those changes, in your commit messages.

Here is a good video on commit messages: https://www.youtube.com/watch?v=pU-VasVPNAs

drain(client, node)

if rebootDelay > 0 {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Next to that, I would probably use the existing throttle function, maybe rename and refactor it so that it's clearer.

I am worried the users might not understand the difference between all those options we have, tbh, so I think it might be worth clarifying in the CLI arguments.

@@ -40,6 +40,7 @@ var (
// Command line flags
forceReboot bool
drainTimeout time.Duration
rebootDelay time.Duration
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be best toremove this from this PR, as it's effectively not linked to the PR title.

Maybe we need to fix the current ELB feature to not require the addition of this delay. This way, the intent of the rebootDelay would would be more clear.

If we don't add this feature in a different PR, we should at least make the boundary very clear by having only 2 commits: One for ELB support, one for the reboot-delay. (You will need to squash your changes appropriately).

@@ -104,6 +114,8 @@ func main() {
"when seconds is greater than zero, skip waiting for the pods whose deletion timestamp is older than N seconds while draining a node (default: 0)")
rootCmd.PersistentFlags().DurationVar(&drainTimeout, "drain-timeout", 0,
"timeout after which the drain is aborted (default: 0, infinite time)")
rootCmd.PersistentFlags().DurationVar(&rebootDelay, "reboot-delay", 0,
"delay reboot for this duration (default: 0, disabled)")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"delay reboot for this duration after drain and exclusion from any ELB"? I feel this will be more understandable by someone not reading the code.

@@ -489,6 +547,7 @@ func rebootAsRequired(nodeID string, rebootCommand []string, sentinelCommand []s
if err != nil {
log.Fatalf("Error retrieving node object via k8s API: %v", err)
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please do not add those spaces changes in this commit. If you want to change the formatting to look better for you, please do it in a different commit/PR.

@evrardjp
Copy link
Collaborator

It's a nice contribution, I like it! I can totally see it making its way to the code base. We can improve the support to other cloud vendors in the future.

Thank you!

@amorey
Copy link
Contributor Author

amorey commented Jul 28, 2021

Thanks for taking a look!

It's a nice contribution, I like it! I can totally see it making its way to the code base. We can improve the support to other cloud vendors in the future.

Using the "node.kubernetes.io/exclude-from-external-load-balancers" node label offloads external load balancer de-registration to the K8S control plane so the code should already be able to handle other cloud vendors.

Maybe we need to fix the current ELB feature to not require the addition of this delay. This way, the intent of the rebootDelay would would be more clear.

As far as I know, there isn't a way to retrieve the external load balancer status from the K8S API which means that to delay the restart automatically would require writing vendor-specific code. Adding a reboot-delay is more straight-forward and easier to maintain. Plus I could imagine it being useful for other things like allowing logging utilities to finish sending messages before restart.

@amorey
Copy link
Contributor Author

amorey commented Aug 4, 2021

I split this up into two separate PRs which address your comments above:
#418
#419

Let me know if there are any issues with the PRs.

@amorey amorey closed this Aug 27, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants