fix: return an error when we cannot swap the replicaset hashes fixes #2050 #2187

jandersen-plaid · 2022-08-10T19:13:28Z

This fixes #2050 by ensuring that, if a rollout needs to delay swapping the replicaset hashes, we do not continue reconciling.

If we were to continue reconciling then we would end up moving the stable (which should be the canary) to 100% and the canary (which should be the stable) to 0%. Importantly, if this is at the end of a rollout where the stable set has been spun down then this will cause a traffic outage because the stable set will have 0 replicas available.

To illustrate the failure mode it would be:

We are at the end of a rollout and start to run reconciliation
We see that the canary replicaset is unhealthy so we do not swap the selectors
We continue to reconcile and set the canary replicaset to 0 and the stable replicaset to 100 (if pods have been removed during this process then this means the stable replicaset could be overwhelmed)
On a subsequent reconciliation we see that the canary replicaset is healthy
We swap the canary and stable selectors
This makes traffic on the correct stable replicaset (the former canary replicaset) 100

The troublesome bit here is step 3 which can be catastrophic if the canary replicaset continues to be unhealthy for an indeterminate period of time. Returning an error would only extend the period at which we are at then end of the rollout but have not completed yet.

Checklist:

Either (a) I've created an enhancement proposal and discussed it with the community, (b) this is a bug fix, or (c) this is a chore.
The title of the PR is (a) conventional, (b) states what changed, and (c) suffixes the related issues number. E.g. "fix(controller): Updates such and such. Fixes #1234".
I've signed my commits with DCO
I have written unit and/or e2e tests for my change. PRs without these are unlikely to be merged.
My builds are green. Try syncing with master if they are not.
My organization is added to USERS.md.

github-actions · 2022-08-10T19:30:18Z

Go Published Test Results

1 801 tests 1 801 ✔️ 2m 31s ⏱️
  102 suites       0 💤
      1 files       0 ❌

Results for commit c5ffcf2.

♻️ This comment has been updated with latest results.

codecov · 2022-08-10T19:32:00Z

Codecov Report

Base: 81.58% // Head: 81.60% // Increases project coverage by +0.02% 🎉

Coverage data is based on head (c5ffcf2) compared to base (c3e305d).
Patch coverage: 100.00% of modified lines in pull request are covered.

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #2187      +/-   ##
==========================================
+ Coverage   81.58%   81.60%   +0.02%     
==========================================
  Files         124      124              
  Lines       18959    18965       +6     
==========================================
+ Hits        15467    15476       +9     
+ Misses       2702     2700       -2     
+ Partials      790      789       -1

Impacted Files	Coverage Δ
rollout/context.go	`91.35% <ø> (ø)`
rollout/service.go	`77.94% <100.00%> (+0.21%)`	⬆️
rollout/trafficrouting.go	`77.01% <100.00%> (+1.52%)`	⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

github-actions · 2022-08-10T21:04:29Z

E2E Tests Published Test Results

    2 files     2 suites 1h 33m 25s ⏱️
  92 tests   87 ✔️ 3 💤 2 ❌
190 runs 178 ✔️ 6 💤 6 ❌

For more details on these failures, see this check.

Results for commit c5ffcf2.

♻️ This comment has been updated with latest results.

jessesuen · 2022-08-10T21:06:31Z

We continue to reconcile and set the canary replicaset to 0 and the stable replicaset to 100

@jandersen-plaid is the underlying bug actually here? I'm wondering if the bug is better fixed by choosing not to change the replica counts because the selectors are not yet swapped?

jandersen-plaid · 2022-08-10T21:18:48Z

We continue to reconcile and set the canary replicaset to 0 and the stable replicaset to 100

@jandersen-plaid is the underlying bug actually here? I'm wondering if the bug is better fixed by choosing not to change the replica counts because the selectors are not yet swapped?

~~Yea that works for me! This is just the quickest way I could think of to achieve the desired outcome.~~

The only issue I can think of for choosing not to change the replica counts would be that you have to communicate that decision between the reconcileStableAndCanaryService reconciliation function and the reconcileTrafficRouting function -- it breaks the pattern of only returning errors, so I was not sure if passing a boolean about whether to swap or not to swap would be un-idiomatic.

Sorry, I was reading too fast and misunderstood.

I don't think this is necessarily an issue with replica counts, but with percentage of traffic rolled out to a specific replicaset -- one moment, will post a more full explanation and example.

jandersen-plaid · 2022-08-10T21:56:21Z

Consider the following rollout logs:

		time	log
		022-08-08 20:26:07	Switched selector for service 'server-stable' from '54cdff4499' to '7cf9c66479'
		2022-08-08 20:26:07	Reconciling TrafficRouting with type 'SMI'
		2022-08-08 20:26:07	Found 1 TrafficRouting Reconcilers
		2022-08-08 20:26:07	Switched selector for service 'server-stable' from '54cdff4499' to '7cf9c66479'
		2022-08-08 20:26:07	Started syncing rollout
		...	...
		2022-08-08 20:24:18	Event(v1.ObjectReference{Kind:"Rollout", Namespace:"team", Name:"server", UID:"78dc26ef-012c-48c8-a0a6-2a4c3dd5d941", APIVersion:"argoproj.io/v1alpha1", ResourceVersion:"697247822", FieldPath:""}): type: 'Normal' reason: 'TrafficWeightUpdated' Traffic weight updated from 100 to 0
		2022-08-08 20:24:18	No Steps remain in the canary steps
		2022-08-08 20:24:18	No StableRS exists to reconcile or matches newRS
		2022-08-08 20:24:18	Traffic weight updated from 100 to 0
		2022-08-08 20:24:18	New weights: &TrafficWeights{Canary:WeightDestination{Weight:0,ServiceName:server-canary,PodTemplateHash:7cf9c66479,},Stable:WeightDestination{Weight:100,ServiceName:server-stable,PodTemplateHash:7cf9c66479,},Additional:[]WeightDestination{},Verified:nil,}
		2022-08-08 20:24:18	Previous weights: &TrafficWeights{Canary:WeightDestination{Weight:100,ServiceName:server-canary,PodTemplateHash:7cf9c66479,},Stable:WeightDestination{Weight:0,ServiceName:server-stable,PodTemplateHash:54cdff4499,},Additional:[]WeightDestination{},Verified:nil,}
		2022-08-08 20:24:18	Analysis has been deleted
		2022-08-08 20:24:18	Started syncing Analysis at (2022-08-09 00:24:18.052059562 +0000 UTC m=+3509967.309868994)
		2022-08-08 20:24:18	Enqueueing parent of team/server-7b5599bddb-255: Rollout team/server
		2022-08-08 20:24:18	Reconciling TrafficRouting with type 'SMI'
		2022-08-08 20:24:18	Enqueueing parent of team/server-7b5599bddb-255: Rollout team/server
		2022-08-08 20:24:18	Found 1 TrafficRouting Reconcilers
		2022-08-08 20:24:18	delaying service switch from 54cdff4499 to 7cf9c66479: ReplicaSet not fully available

The issue is at 20:24:18 where we have changed the weights via the traffic routing controller. The replica counts have already maxed out with the last step, and the rollout has "finished".

the bug is better fixed by choosing not to change the replica counts because the selectors are not yet swapped?

I don't think this solves the issue because the replica counts are already set to 0 (for the stable) and the maximum (for the canary). That being said, I don't know these logs like you do so feel free to correct me on the timeline.

sonarcloud · 2022-08-10T22:21:56Z

Kudos, SonarCloud Quality Gate passed!

0 Bugs
0 Vulnerabilities
0 Security Hotspots
1 Code Smell

No Coverage information
0.0% Duplication

jandersen-plaid · 2022-08-10T22:22:06Z

Considering the number of end to end test failures, this is definitely the wrong approach though.

jessesuen · 2022-08-17T20:46:50Z

Some questions:

What version of Rollouts are you running?
Could you paste the Rollout spec so we can understand all the options in play?
Are you able to provide some explicit steps/example rollout to reproduce the issue?
If easily reproducible, could you see if the issue still exists in v1.3.0-rc1?

We had some discussion about this. The bug might be in reconcileTrafficRouting() where we are adjusting weights even though selectors are not yet swapped, and the fix may need to consider current label selectors before making the weight change of 0% canary.

jandersen-plaid · 2022-08-18T22:04:24Z

Hey @jessesuen thanks for following up here!

Some questions:

What version of Rollouts are you running?
We are running argo rollouts v1.2.0:

      containers:
      - image: "245200388354.dkr.ecr.us-east-1.amazonaws.com/quay/argoproj/argo-rollouts:v1.2.0"
        args:
        - --leader-elect
        imagePullPolicy: IfNotPresent
        name: argo-rollouts

Could you paste the Rollout spec so we can understand all the options in play?

Here it is:

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  labels:    
    app.kubernetes.io/managed-by: "Helm"
    app.kubernetes.io/name: "server"
    app.kubernetes.io/component: "server"
    k8s.plaid.io/team: "cs"
    app.kubernetes.io/version: "3.20.0"
    helm.sh/chart: "plaid-app-3.20.0"
  name: server
  namespace: cs-team
spec:
  replicas: 450
  strategy:
    canary:
      dynamicStableScale: true
      canaryService: server-canary
      canaryMetadata:
        labels:
          k8s.plaid.io/deployment-role: canary
      stableService: server-stable
      stableMetadata:
        labels:
          k8s.plaid.io/deployment-role: stable

      trafficRouting:
        smi:
          trafficSplitName: server-traffic-split
          rootService: server
      analysis:
        templates:
        - templateName: server-analysis
        args:
        - name: service
          valueFrom:
            fieldRef:
              fieldPath: "metadata.labels['app.kubernetes.io/name']"
        - name: commit
          valueFrom:
            fieldRef:
              fieldPath: "metadata.labels['k8s.plaid.io/commit-sha']"
        - name: stable-hash
          valueFrom:
            podTemplateHashValue: Stable
        - name: latest-hash
          valueFrom:
            podTemplateHashValue: Latest
      steps:
        - setWeight: 1
        - pause:
            duration: 10m0s
        - setWeight: 10
        - pause:
            duration: 20m0s
        - setWeight: 25
        - pause:
            duration: 30m0s
        - setWeight: 50
        - pause:
            duration: 30m0s
        - setWeight: 100
        - pause:
            duration: 30m0s
  revisionHistoryLimit: 2
  selector:
    matchLabels:      
      app.kubernetes.io/managed-by: "Helm"
      app.kubernetes.io/name: "server"
      app.kubernetes.io/component: "server"
      k8s.plaid.io/team: "cs"
  template:    
    metadata:
      labels:    
        app.kubernetes.io/managed-by: "Helm"
        app.kubernetes.io/name: "server"
        app.kubernetes.io/component: "server"
        k8s.plaid.io/team: "cs"
        app.kubernetes.io/version: "3.20.0"
        helm.sh/chart: "plaid-app-3.20.0"    
        k8s.plaid.io/commit-sha: "05354ced8b0e3597227a5d6958aab2222dbe28d3"
      annotations:
        iam.amazonaws.com/role: k8s-server-production
        cluster-autoscaler.kubernetes.io/safe-to-evict: "true"
        linkerd.io/inject: enabled
        config.linkerd.io/skip-outbound-ports: "25,3306,5432,6379,11211,27017"
        config.linkerd.io/close-wait-timeout: "3600s"
        config.linkerd.io/proxy-cpu-request: "100m"
        config.linkerd.io/proxy-memory-request: "100Mi"
        config.linkerd.io/proxy-cpu-limit: ""
        config.linkerd.io/proxy-memory-limit: "1Gi"
        config.alpha.linkerd.io/proxy-wait-before-exit-seconds: "185"
        config.linkerd.io/proxy-log-format: "json"
        config.linkerd.io/proxy-log-level: warn,linkerd2_proxy=info    
    spec:
      terminationGracePeriodSeconds: 185
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: k8s.plaid.io/spot
                operator: DoesNotExist
                values: []
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - podAffinityTerm:
              labelSelector:
                matchLabels:
                  app.kubernetes.io/component: server
                  app.kubernetes.io/name: server
              topologyKey: kubernetes.io/hostname
            weight: 100
      containers:
      - name: app
        image: "server:05354ced"
        resources:
          limits:
            cpu: 8000m
            memory: 16Gi
          requests:
            cpu: 4000m
            memory: 12Gi
        livenessProbe:
          httpGet:
            path: /health
            port: 8025
          initialDelaySeconds: 60
          periodSeconds: 60
          successThreshold: 1 # liveness must be 1
          failureThreshold: 5
          timeoutSeconds: 60
        readinessProbe:
          httpGet:
            path: /health
            port: 8025
          initialDelaySeconds: 60
          periodSeconds: 60
          successThreshold: 1
          failureThreshold: 5
          timeoutSeconds: 60
        command:
        - "./start_server.sh"
        env:
        - name: PLAID_ENV
          value: "production"
        - name: K8S_METADATA_POD_NAME
          valueFrom:
            fieldRef:
              fieldPath: metadata.name
        - name: K8S_METADATA_NAMESPACE
          valueFrom:
            fieldRef:
              fieldPath: metadata.namespace
        - name: K8S_LABEL_APP_KUBERNETES_IO_MANAGED_BY
          valueFrom:
            fieldRef:
              fieldPath: metadata.labels['app.kubernetes.io/managed-by']
        - name: K8S_LABEL_APP_KUBERNETES_IO_NAME
          valueFrom:
            fieldRef:
              fieldPath: metadata.labels['app.kubernetes.io/name']
        - name: K8S_LABEL_APP_KUBERNETES_IO_TEAM
          valueFrom:
            fieldRef:
              fieldPath: metadata.labels['k8s.plaid.io/team']
        - name: K8S_LABEL_APP_KUBERNETES_IO_VERSION
          valueFrom:
            fieldRef:
              fieldPath: metadata.labels['app.kubernetes.io/version']
        - name: K8S_LABEL_HELM_SH_CHART
          valueFrom:
            fieldRef:
              fieldPath: metadata.labels['helm.sh/chart']
        - name: K8S_LABEL_K8S_PLAID_IO_NETWORK_ACCESS_GROUP
          valueFrom:
            fieldRef:
              fieldPath: metadata.labels['k8s.plaid.io/network-access-group']
        - name: K8S_HOST_IP
          valueFrom:
            fieldRef:
              fieldPath: status.hostIP
        - name: K8S_POD_IP
          valueFrom:
            fieldRef:
              fieldPath: status.podIP
        - name: K8S_NODE_NAME
          valueFrom:
            fieldRef:
              fieldPath: spec.nodeName
        ports:
        - name: main
          containerPort: 8024
        - name: metrics
          containerPort: 8025
      dnsConfig:
        options:
        - name: ndots
          value: "1"

The most key thing here is that dynamicStableScale is set.

Are you able to provide some explicit steps/example rollout to reproduce the issue?

This is a bit tougher. From what I know, to reproduce the issue:

Create a SMI canary rollout (at the very least)
Set dynamicStableScale to true
Have a pause duration on the final step for some number of minutes after the weight has been set to 100
During this time, reduce the capacity on your cluster so that the replicasets are not all healthy (this must be done while the final step is finishing its pause duration)
This should force argo to delay swapping the service selectors, but then call the SMI traffic routing to swap the weights on the trafficsplit.

At this point, I am not sure whether the bug applies just to SMI Rollouts, or any Rollout that does not need to have knowledge of the replicaset hash at any given time.

If easily reproducible, could you see if the issue still exists in v1.3.0-rc1?

I can see if it still exists there, for sure! I don't think this is easily reproducible in a kind or minikube cluster yet, but I think I can take away some instances on one of our testing clusters to simulate this.

We had some discussion about this. The bug might be in reconcileTrafficRouting() where we are adjusting weights even though selectors are not yet swapped, and the fix may need to consider current label selectors before making the weight change of 0% canary.

I agree. I was looking into that when I last had to leave this work as we found a work around (just disable dynamicStableScale at the cost of many extra instances while deploying) and I was moved onto some other projects.

From what I could tell (starting from the point where we have delayed swapping the selectors and want to reconcile traffic routing at the final step):

We enter into the FullyPromoted block here:

argo-rollouts/rollout/trafficrouting.go

Line 140 in df52869

if rolloututil.IsFullyPromoted(c.rollout) {

but the SMI traffic reconciler doesn't manage routes, so that is a noop (

argo-rollouts/rollout/trafficrouting/smi/smi.go

Line 359 in df52869

func (r *Reconciler) RemoveManagedRoutes() error {

)
After that we do a null and generation check that should not apply because they have long since been set.
Then we update the hash, but this is something that the SMI traffic reconciler (again) doesn't care about (and rightly so):

argo-rollouts/rollout/trafficrouting.go

Line 218 in df52869

err = reconciler.UpdateHash(canaryHash, stableHash, weightDestinations...)
So, when the weights end up being set here:

argo-rollouts/rollout/trafficrouting.go

Line 223 in df52869

err = reconciler.SetWeight(desiredWeight, weightDestinations...)

they are for these WeightDestinations:

&TrafficWeights{
    Canary:WeightDestination{
        Weight:0,
        ServiceName:server-canary,
        PodTemplateHash:7cf9c66479,
    },
    Stable:WeightDestination{
        Weight:100,
        ServiceName:server-stable,
        PodTemplateHash:7cf9c66479,
    },
    Additional:[]WeightDestination{},
    Verified:nil,
}

Normally, this would be okay, but the SMI Traffic Reconciler doesn't track or care about PodTemplateHash (

argo-rollouts/rollout/trafficrouting/smi/smi.go

Line 320 in df52869

backends := []smiv1alpha3.TrafficSplitBackend{{

) so we are switching traffic from the canary to the stable, but the stable points to a replicaset that has been at 0 replicas for the past 30 minutes.

The only way I could think to fix this was to make the traffic reconciler aware of the hashes by adding a map to the reconciler which is updated with the newest hashes in UpdateHash and then validating, when calling SetWeight, that the hashes on the WeightDestination are what we expect them to be in the cluster.

That being said, it could probably be simplified by just having UpdateHash check that the weight destination's PodTemplateHashs match what is in the cluster and erroring if they do not match.

jandersen-plaid · 2022-11-08T22:24:01Z

oh no 😮‍💨 (really bad rebase)

…rgoproj#2050 Signed-off-by: Jack Andersen <[email protected]>

Signed-off-by: Jack Andersen <[email protected]>

…ce swaps Signed-off-by: Jack Andersen <[email protected]>

Signed-off-by: Jack Andersen <[email protected]>

sonarcloud · 2022-11-26T05:03:49Z

Kudos, SonarCloud Quality Gate passed!

0 Bugs
0 Vulnerabilities
0 Security Hotspots
0 Code Smells

No Coverage information
0.0% Duplication

zachaller · 2022-12-01T16:56:55Z

going to close in favor of #2441

jandersen-plaid force-pushed the jandersen-plaid-error-on-delay branch from 94533f3 to 408bc79 Compare August 10, 2022 19:16

MarkSRobinson approved these changes Aug 10, 2022

View reviewed changes

jandersen-plaid force-pushed the jandersen-plaid-error-on-delay branch from fddf5d6 to dcf70df Compare August 10, 2022 22:20

zachaller mentioned this pull request Sep 7, 2022

Service selector switched after older replica scaled down to zero causing an outage #2235

Closed

2 tasks

jandersen-plaid force-pushed the jandersen-plaid-error-on-delay branch 4 times, most recently from da5ccb2 to bbc818e Compare November 8, 2022 22:40

zachaller mentioned this pull request Nov 9, 2022

Argo rollouts will scale down stable when canary is missing pods #2050

Closed

jandersen-plaid force-pushed the jandersen-plaid-error-on-delay branch from 99127a9 to 19862c9 Compare November 26, 2022 02:39

jandersen-plaid added 9 commits November 25, 2022 19:54

fix: return an error when we cannot swap the replicaset hashes fixes a…

c265366

…rgoproj#2050 Signed-off-by: Jack Andersen <[email protected]>

update USERS.md as per PR guidelines

b5ad504

Signed-off-by: Jack Andersen <[email protected]>

Add a test that the error returns a non empty message

b223b33

Signed-off-by: Jack Andersen <[email protected]>

Add a unit test for testing the failing reconciliation

9bc2cc0

Signed-off-by: Jack Andersen <[email protected]>

Use a different method of propogating information about delayed servi…

6f40ca0

…ce swaps Signed-off-by: Jack Andersen <[email protected]>

Avoid code smell

49e9246

Signed-off-by: Jack Andersen <[email protected]>

Fix the tests to get them to cooperate

9c81d77

Signed-off-by: Jack Andersen <[email protected]>

Remove unused error and return on selector skip with error

002f75c

Signed-off-by: Jack Andersen <[email protected]>

Only delay changing weights when DynamicStableScale is set

0db6d11

Signed-off-by: Jack Andersen <[email protected]>

jandersen-plaid force-pushed the jandersen-plaid-error-on-delay branch from 46cad81 to 0db6d11 Compare November 26, 2022 04:51

Add the right flag to the canary strategy for dynamic stable scale

c5ffcf2

Signed-off-by: Jack Andersen <[email protected]>

zachaller closed this Dec 1, 2022

zachaller mentioned this pull request Dec 2, 2022

fix(trafficrouting): Do not block the switch of service selectors for single pod failures #2441

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: return an error when we cannot swap the replicaset hashes fixes #2050 #2187

fix: return an error when we cannot swap the replicaset hashes fixes #2050 #2187

jandersen-plaid commented Aug 10, 2022 •

edited

Loading

github-actions bot commented Aug 10, 2022 •

edited

Loading

codecov bot commented Aug 10, 2022 •

edited

Loading

github-actions bot commented Aug 10, 2022 •

edited

Loading

jessesuen commented Aug 10, 2022

jandersen-plaid commented Aug 10, 2022 •

edited

Loading

jandersen-plaid commented Aug 10, 2022 •

edited

Loading

sonarcloud bot commented Aug 10, 2022

jandersen-plaid commented Aug 10, 2022

jessesuen commented Aug 17, 2022 •

edited

Loading

jandersen-plaid commented Aug 18, 2022

jandersen-plaid commented Nov 8, 2022 •

edited

Loading

sonarcloud bot commented Nov 26, 2022

zachaller commented Dec 1, 2022

fix: return an error when we cannot swap the replicaset hashes fixes #2050 #2187

fix: return an error when we cannot swap the replicaset hashes fixes #2050 #2187

Conversation

jandersen-plaid commented Aug 10, 2022 • edited Loading

github-actions bot commented Aug 10, 2022 • edited Loading

Go Published Test Results

codecov bot commented Aug 10, 2022 • edited Loading

Codecov Report

github-actions bot commented Aug 10, 2022 • edited Loading

E2E Tests Published Test Results

jessesuen commented Aug 10, 2022

jandersen-plaid commented Aug 10, 2022 • edited Loading

jandersen-plaid commented Aug 10, 2022 • edited Loading

sonarcloud bot commented Aug 10, 2022

jandersen-plaid commented Aug 10, 2022

jessesuen commented Aug 17, 2022 • edited Loading

jandersen-plaid commented Aug 18, 2022

jandersen-plaid commented Nov 8, 2022 • edited Loading

sonarcloud bot commented Nov 26, 2022

zachaller commented Dec 1, 2022

jandersen-plaid commented Aug 10, 2022 •

edited

Loading

github-actions bot commented Aug 10, 2022 •

edited

Loading

codecov bot commented Aug 10, 2022 •

edited

Loading

github-actions bot commented Aug 10, 2022 •

edited

Loading

jandersen-plaid commented Aug 10, 2022 •

edited

Loading

jandersen-plaid commented Aug 10, 2022 •

edited

Loading

jessesuen commented Aug 17, 2022 •

edited

Loading

jandersen-plaid commented Nov 8, 2022 •

edited

Loading