[NET-10544] bugfix: catalog sync fails repeatedly to deregister a service in some scenarios #4266

nathancoleman · 2024-08-27T15:36:19Z

Changes proposed in this PR

Stop using blocking queries for listing services for potential reaping
- This can hurt us when we use blocking queries and no change happens during the wait period, which causes the request to error and retry forever when we actually just needed the existing list of catalog services so that we could diff against the existing list of k8s services
Add a limit to the backoff-retry mechanism so that we don't infinitely retry using a bad client
- This can hurt us when consul-server-connection-manager isn't able to get a valid IP, for example, and the client will be configured with <nil> for the host and retry forever

How I've tested this PR

Install Consul to your k8s cluster with catalog sync enabled and this build for global.imageK8S
Create a k8s Service selecting one or more Pods
Delete the consul-server Pod
Delete the k8s Service
Observe that the corresponding Consul service is now deregistered where it was not before this change

values.yaml

global:
  name: consul
  datacenter: dc1

  imageK8S: consul-k8s-control-plane:dev

  tls:
    enabled: true
    enableAutoEncrypt: true
  
  acls:
    manageSystemACLs: true

  metrics:
    enabled: true

connectInject:
  enabled: false

syncCatalog:
  enabled: true
  metrics:
    enabled: true

prometheus:
  enabled: true

ui:
  enabled: true
  metrics:
    provider: "prometheus"
    baseURL: http://prometheus-server.prometheus

How I expect reviewers to test this PR

See above

Checklist

Tests added
CHANGELOG entry added

nathancoleman

Personal review

Note: I recommend reviewing in the "split diff" view since much of this change is restructuring code to make it more readable and to make the similarities between two codepaths more apparent.

nathancoleman · 2024-08-27T16:09:13Z

control-plane/catalog/to-consul/syncer.go

-		err = backoff.Retry(func() error {
-			services, meta, err = consulClient.Catalog().NodeServiceList(s.ConsulNodeName, opts)
-			return err
-		}, backoff.WithContext(backoff.NewExponentialBackOff(), ctx))


This is one of 2 major problems addressed in this PR. The fact that these backoffs had no maximum meant that they would retry forever, until the sync process was terminated.

This combined with using blocking queries which returned errors when nothing happened in the wait period meant that the services were never actually returned in clusters with no config changes and so expected deregistrations would never happen once a k8s Service was deleted.

nathancoleman · 2024-08-27T16:11:54Z

control-plane/catalog/to-consul/syncer.go

-		WaitIndex:  1,
-		WaitTime:   1 * time.Minute,


The use of blocking queries here is the second issue addressed by this PR. To my knowledge, there is no reason to use blocking queries here. They were returning errors when no updates occurred during the wait period, so the failing request would be retried over and over, forever.

This combined with the backoff-retry having no limit meant that the services would never actually be returned and compared with the k8s Services, so expected deregistrations would never actually happen once a k8s Service was deleted.

control-plane/catalog/to-consul/syncer.go

Refactor catalog sync de-registration handling to be more resilient

bbfb5dc

nathancoleman force-pushed the resilient-sync branch from 1896944 to bbfb5dc Compare August 27, 2024 15:44

nathancoleman commented Aug 27, 2024

View reviewed changes

xwa153 reviewed Aug 27, 2024

View reviewed changes

control-plane/catalog/to-consul/syncer.go Show resolved Hide resolved

nathancoleman added backport/1.1.x Backport to release/1.1.x branch backport/1.3.x backport/1.4.x backport/1.5.x labels Aug 29, 2024

nathancoleman marked this pull request as ready for review August 29, 2024 21:36

Add changelog entry

37f731f

nathancoleman requested a review from NiniOak August 29, 2024 21:40

NiniOak approved these changes Aug 29, 2024

View reviewed changes

Merge branch 'main' into resilient-sync

e94849b

nathancoleman merged commit 1164953 into main Aug 30, 2024
50 checks passed

nathancoleman deleted the resilient-sync branch August 30, 2024 16:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[NET-10544] bugfix: catalog sync fails repeatedly to deregister a service in some scenarios #4266

[NET-10544] bugfix: catalog sync fails repeatedly to deregister a service in some scenarios #4266

nathancoleman commented Aug 27, 2024 •

edited

Loading

nathancoleman left a comment

nathancoleman Aug 27, 2024 •

edited

Loading

nathancoleman Aug 27, 2024

[NET-10544] bugfix: catalog sync fails repeatedly to deregister a service in some scenarios #4266

[NET-10544] bugfix: catalog sync fails repeatedly to deregister a service in some scenarios #4266

Conversation

nathancoleman commented Aug 27, 2024 • edited Loading

Changes proposed in this PR

How I've tested this PR

How I expect reviewers to test this PR

Checklist

nathancoleman left a comment

Choose a reason for hiding this comment

nathancoleman Aug 27, 2024 • edited Loading

Choose a reason for hiding this comment

nathancoleman Aug 27, 2024

Choose a reason for hiding this comment

nathancoleman commented Aug 27, 2024 •

edited

Loading

nathancoleman Aug 27, 2024 •

edited

Loading