-
Notifications
You must be signed in to change notification settings - Fork 323
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[NET-10544] bugfix: catalog sync fails repeatedly to deregister a service in some scenarios #4266
Conversation
1896944
to
bbfb5dc
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personal review
Note: I recommend reviewing in the "split diff" view since much of this change is restructuring code to make it more readable and to make the similarities between two codepaths more apparent.
err = backoff.Retry(func() error { | ||
services, meta, err = consulClient.Catalog().NodeServiceList(s.ConsulNodeName, opts) | ||
return err | ||
}, backoff.WithContext(backoff.NewExponentialBackOff(), ctx)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is one of 2 major problems addressed in this PR. The fact that these backoffs had no maximum meant that they would retry forever, until the sync process was terminated.
This combined with using blocking queries which returned errors when nothing happened in the wait period meant that the services were never actually returned in clusters with no config changes and so expected deregistrations would never happen once a k8s Service
was deleted.
WaitIndex: 1, | ||
WaitTime: 1 * time.Minute, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The use of blocking queries here is the second issue addressed by this PR. To my knowledge, there is no reason to use blocking queries here. They were returning errors when no updates occurred during the wait period, so the failing request would be retried over and over, forever.
This combined with the backoff-retry having no limit meant that the services would never actually be returned and compared with the k8s Services
, so expected deregistrations would never actually happen once a k8s Service
was deleted.
Changes proposed in this PR
<nil>
for the host and retry foreverHow I've tested this PR
global.imageK8S
Service
selecting one or morePods
Pod
Service
values.yaml
How I expect reviewers to test this PR
See above
Checklist