Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add refreshes and retries to server-acl-init job #3137

Merged
merged 1 commit into from
Nov 21, 2023

Conversation

curtbushko
Copy link
Contributor

@curtbushko curtbushko commented Oct 26, 2023

Changes proposed in this PR:

  • When the server-acl-init job runs it boostraps ACL tokens and sets a bunch of things on the server up
  • It gets the static IPs for the servers, creates a boostrap token and sets a policy using that IP
  • Then consul-server-connection-manager runs and other ACLs get set up using the consul client
  • Usually this is fine.
  • But during upgrades the servers statefulsets can change and thus the server IPs change.
  • Our code was set to retry over and over again and would eventually timeout after 10 minutes
  • This does not work as it forces customers to run the upgrades several times in hopes that the server IPs do not change.
  • I've created a wrapper around the consul client called DynamicClient that allows you to refresh the IPs using consul-server-connection-manager
  • The DynamicClient will be useful in other place also.
  • A customer reported a single error but in my testing I discovered about 5 other places where the IPs would be incorrect and we needed to refresh.
  • The key areas are around 'untilSucceeds()' as these consul calls would fail forever

How I've tested this PR:

  • Added a unit test to DynamicClient that simulates a bad server ip, a refresh with a good ip and a client call.
  • The race condition does not occur all the time but I have been able to reproduce it manually by upgrading over and over again, which sometimes forces the servers to change IPs.

How I expect reviewers to test this PR:

👀

Checklist:

@curtbushko curtbushko force-pushed the NET-5991-acl-init-upgrade branch 2 times, most recently from 27ec9f7 to fb5d03e Compare November 17, 2023 19:08
@curtbushko curtbushko self-assigned this Nov 17, 2023
@curtbushko curtbushko added backport/1.1.x Backport to release/1.1.x branch backport/1.2.x This release branch is no longer active. backport/1.3.x labels Nov 17, 2023
@curtbushko curtbushko marked this pull request as ready for review November 17, 2023 19:16
Add refreshes and retries to server-acl-init job
@curtbushko curtbushko requested review from DanStough and removed request for thisisnotashwin November 20, 2023 15:48
Copy link
Contributor

@DanStough DanStough left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how I feel about the nest backoff/retry, but otherwise looks great! Thank you!

control-plane/consul/dynamic.go Show resolved Hide resolved
control-plane/consul/dynamic.go Show resolved Hide resolved
Comment on lines +1063 to +1074
if err = backoff.Retry(func() error {
ipAddrs, err = netaddrs.IPAddrs(c.ctx, c.consulFlags.Addresses, c.log)
if err != nil {
c.log.Error("Error resolving IP Address", "err", err)
return err
}
c.log.Info("Refreshing server IP addresses", "addresses", ipAddrs)
return nil
}, exponentialBackoffWithMaxInterval()); err != nil {
return nil, err
}
return ipAddrs, nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want the nested backoff/retry? There may be a case for resetting the parent loop, but it might be easier to reason about with one-level.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not a fan of the style of it but we are using a similar retry style during the bootstrap (line 330) and I went with that for consistency.

@curtbushko curtbushko merged commit 6638261 into main Nov 21, 2023
44 of 48 checks passed
@curtbushko curtbushko deleted the NET-5991-acl-init-upgrade branch November 21, 2023 20:39
@curtbushko curtbushko restored the NET-5991-acl-init-upgrade branch November 21, 2023 20:41
@curtbushko curtbushko deleted the NET-5991-acl-init-upgrade branch November 21, 2023 20:42
curtbushko added a commit that referenced this pull request Nov 28, 2023
…to release/1.2.x (#3251)

Add refreshes and retries to server-acl-init job
curtbushko added a commit that referenced this pull request Nov 28, 2023
…to release/1.1.x (#3252)

Add refreshes and retries to server-acl-init job
sarahalsmiller pushed a commit that referenced this pull request Jan 5, 2024
Add refreshes and retries to server-acl-init job
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport/1.1.x Backport to release/1.1.x branch backport/1.2.x This release branch is no longer active. backport/1.3.x
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants