Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[NET-10544] bugfix: catalog sync fails repeatedly to deregister a service in some scenarios #4266

Merged
merged 3 commits into from
Aug 30, 2024

Conversation

nathancoleman
Copy link
Member

@nathancoleman nathancoleman commented Aug 27, 2024

Changes proposed in this PR

  • Stop using blocking queries for listing services for potential reaping
    • This can hurt us when we use blocking queries and no change happens during the wait period, which causes the request to error and retry forever when we actually just needed the existing list of catalog services so that we could diff against the existing list of k8s services
  • Add a limit to the backoff-retry mechanism so that we don't infinitely retry using a bad client
    • This can hurt us when consul-server-connection-manager isn't able to get a valid IP, for example, and the client will be configured with <nil> for the host and retry forever

How I've tested this PR

  1. Install Consul to your k8s cluster with catalog sync enabled and this build for global.imageK8S
  2. Create a k8s Service selecting one or more Pods
  3. Delete the consul-server Pod
  4. Delete the k8s Service
  5. Observe that the corresponding Consul service is now deregistered where it was not before this change
values.yaml
global:
  name: consul
  datacenter: dc1

  imageK8S: consul-k8s-control-plane:dev

  tls:
    enabled: true
    enableAutoEncrypt: true
  
  acls:
    manageSystemACLs: true

  metrics:
    enabled: true

connectInject:
  enabled: false

syncCatalog:
  enabled: true
  metrics:
    enabled: true

prometheus:
  enabled: true

ui:
  enabled: true
  metrics:
    provider: "prometheus"
    baseURL: http://prometheus-server.prometheus

How I expect reviewers to test this PR

See above

Checklist

Copy link
Member Author

@nathancoleman nathancoleman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personal review

Note: I recommend reviewing in the "split diff" view since much of this change is restructuring code to make it more readable and to make the similarities between two codepaths more apparent.

Comment on lines -201 to -243
err = backoff.Retry(func() error {
services, meta, err = consulClient.Catalog().NodeServiceList(s.ConsulNodeName, opts)
return err
}, backoff.WithContext(backoff.NewExponentialBackOff(), ctx))
Copy link
Member Author

@nathancoleman nathancoleman Aug 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is one of 2 major problems addressed in this PR. The fact that these backoffs had no maximum meant that they would retry forever, until the sync process was terminated.

This combined with using blocking queries which returned errors when nothing happened in the wait period meant that the services were never actually returned in clusters with no config changes and so expected deregistrations would never happen once a k8s Service was deleted.

Comment on lines -178 to -218
WaitIndex: 1,
WaitTime: 1 * time.Minute,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The use of blocking queries here is the second issue addressed by this PR. To my knowledge, there is no reason to use blocking queries here. They were returning errors when no updates occurred during the wait period, so the failing request would be retried over and over, forever.

This combined with the backoff-retry having no limit meant that the services would never actually be returned and compared with the k8s Services, so expected deregistrations would never actually happen once a k8s Service was deleted.

@nathancoleman nathancoleman marked this pull request as ready for review August 29, 2024 21:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants