Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: return an error when we cannot swap the replicaset hashes fixes #2050 #2187

Conversation

jandersen-plaid
Copy link

@jandersen-plaid jandersen-plaid commented Aug 10, 2022

This fixes #2050 by ensuring that, if a rollout needs to delay swapping the replicaset hashes, we do not continue reconciling.

If we were to continue reconciling then we would end up moving the stable (which should be the canary) to 100% and the canary (which should be the stable) to 0%. Importantly, if this is at the end of a rollout where the stable set has been spun down then this will cause a traffic outage because the stable set will have 0 replicas available.

To illustrate the failure mode it would be:

  1. We are at the end of a rollout and start to run reconciliation
  2. We see that the canary replicaset is unhealthy so we do not swap the selectors
  3. We continue to reconcile and set the canary replicaset to 0 and the stable replicaset to 100 (if pods have been removed during this process then this means the stable replicaset could be overwhelmed)
  4. On a subsequent reconciliation we see that the canary replicaset is healthy
  5. We swap the canary and stable selectors
  6. This makes traffic on the correct stable replicaset (the former canary replicaset) 100

The troublesome bit here is step 3 which can be catastrophic if the canary replicaset continues to be unhealthy for an indeterminate period of time. Returning an error would only extend the period at which we are at then end of the rollout but have not completed yet.

Checklist:

  • Either (a) I've created an enhancement proposal and discussed it with the community, (b) this is a bug fix, or (c) this is a chore.
  • The title of the PR is (a) conventional, (b) states what changed, and (c) suffixes the related issues number. E.g. "fix(controller): Updates such and such. Fixes #1234".
  • I've signed my commits with DCO
  • I have written unit and/or e2e tests for my change. PRs without these are unlikely to be merged.
  • My builds are green. Try syncing with master if they are not.
  • My organization is added to USERS.md.

@github-actions
Copy link
Contributor

github-actions bot commented Aug 10, 2022

Go Published Test Results

1 801 tests   1 801 ✔️  2m 31s ⏱️
   102 suites         0 💤
       1 files           0

Results for commit c5ffcf2.

♻️ This comment has been updated with latest results.

@codecov
Copy link

codecov bot commented Aug 10, 2022

Codecov Report

Base: 81.58% // Head: 81.60% // Increases project coverage by +0.02% 🎉

Coverage data is based on head (c5ffcf2) compared to base (c3e305d).
Patch coverage: 100.00% of modified lines in pull request are covered.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #2187      +/-   ##
==========================================
+ Coverage   81.58%   81.60%   +0.02%     
==========================================
  Files         124      124              
  Lines       18959    18965       +6     
==========================================
+ Hits        15467    15476       +9     
+ Misses       2702     2700       -2     
+ Partials      790      789       -1     
Impacted Files Coverage Δ
rollout/context.go 91.35% <ø> (ø)
rollout/service.go 77.94% <100.00%> (+0.21%) ⬆️
rollout/trafficrouting.go 77.01% <100.00%> (+1.52%) ⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@github-actions
Copy link
Contributor

github-actions bot commented Aug 10, 2022

E2E Tests Published Test Results

    2 files      2 suites   1h 33m 25s ⏱️
  92 tests   87 ✔️ 3 💤 2
190 runs  178 ✔️ 6 💤 6

For more details on these failures, see this check.

Results for commit c5ffcf2.

♻️ This comment has been updated with latest results.

@jessesuen
Copy link
Member

We continue to reconcile and set the canary replicaset to 0 and the stable replicaset to 100

@jandersen-plaid is the underlying bug actually here? I'm wondering if the bug is better fixed by choosing not to change the replica counts because the selectors are not yet swapped?

@jandersen-plaid
Copy link
Author

jandersen-plaid commented Aug 10, 2022

We continue to reconcile and set the canary replicaset to 0 and the stable replicaset to 100

@jandersen-plaid is the underlying bug actually here? I'm wondering if the bug is better fixed by choosing not to change the replica counts because the selectors are not yet swapped?

Yea that works for me! This is just the quickest way I could think of to achieve the desired outcome.

The only issue I can think of for choosing not to change the replica counts would be that you have to communicate that decision between the reconcileStableAndCanaryService reconciliation function and the reconcileTrafficRouting function -- it breaks the pattern of only returning errors, so I was not sure if passing a boolean about whether to swap or not to swap would be un-idiomatic.

Sorry, I was reading too fast and misunderstood.

I don't think this is necessarily an issue with replica counts, but with percentage of traffic rolled out to a specific replicaset -- one moment, will post a more full explanation and example.

@jandersen-plaid
Copy link
Author

jandersen-plaid commented Aug 10, 2022

Consider the following rollout logs:

time log
022-08-08 20:26:07 Switched selector for service 'server-stable' from '54cdff4499' to '7cf9c66479'
    2022-08-08 20:26:07 Reconciling TrafficRouting with type 'SMI'
    2022-08-08 20:26:07 Found 1 TrafficRouting Reconcilers
    2022-08-08 20:26:07 Switched selector for service 'server-stable' from '54cdff4499' to '7cf9c66479'
    2022-08-08 20:26:07 Started syncing rollout
... ...
2022-08-08 20:24:18 Event(v1.ObjectReference{Kind:"Rollout", Namespace:"team", Name:"server", UID:"78dc26ef-012c-48c8-a0a6-2a4c3dd5d941", APIVersion:"argoproj.io/v1alpha1", ResourceVersion:"697247822", FieldPath:""}): type: 'Normal' reason: 'TrafficWeightUpdated' Traffic weight updated from 100 to 0
2022-08-08 20:24:18 No Steps remain in the canary steps
    2022-08-08 20:24:18 No StableRS exists to reconcile or matches newRS
    2022-08-08 20:24:18 Traffic weight updated from 100 to 0
    2022-08-08 20:24:18 New weights: &TrafficWeights{Canary:WeightDestination{Weight:0,ServiceName:server-canary,PodTemplateHash:7cf9c66479,},Stable:WeightDestination{Weight:100,ServiceName:server-stable,PodTemplateHash:7cf9c66479,},Additional:[]WeightDestination{},Verified:nil,}
    2022-08-08 20:24:18 Previous weights: &TrafficWeights{Canary:WeightDestination{Weight:100,ServiceName:server-canary,PodTemplateHash:7cf9c66479,},Stable:WeightDestination{Weight:0,ServiceName:server-stable,PodTemplateHash:54cdff4499,},Additional:[]WeightDestination{},Verified:nil,}
    2022-08-08 20:24:18 Analysis has been deleted
    2022-08-08 20:24:18 Started syncing Analysis at (2022-08-09 00:24:18.052059562 +0000 UTC m=+3509967.309868994)
    2022-08-08 20:24:18 Enqueueing parent of team/server-7b5599bddb-255: Rollout team/server
    2022-08-08 20:24:18 Reconciling TrafficRouting with type 'SMI'
    2022-08-08 20:24:18 Enqueueing parent of team/server-7b5599bddb-255: Rollout team/server
    2022-08-08 20:24:18 Found 1 TrafficRouting Reconcilers
    2022-08-08 20:24:18 delaying service switch from 54cdff4499 to 7cf9c66479: ReplicaSet not fully available

The issue is at 20:24:18 where we have changed the weights via the traffic routing controller. The replica counts have already maxed out with the last step, and the rollout has "finished".

the bug is better fixed by choosing not to change the replica counts because the selectors are not yet swapped?

I don't think this solves the issue because the replica counts are already set to 0 (for the stable) and the maximum (for the canary). That being said, I don't know these logs like you do so feel free to correct me on the timeline.

@sonarcloud
Copy link

sonarcloud bot commented Aug 10, 2022

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 1 Code Smell

No Coverage information No Coverage information
0.0% 0.0% Duplication

@jandersen-plaid
Copy link
Author

Considering the number of end to end test failures, this is definitely the wrong approach though.

@jessesuen
Copy link
Member

jessesuen commented Aug 17, 2022

Some questions:

  1. What version of Rollouts are you running?
  2. Could you paste the Rollout spec so we can understand all the options in play?
  3. Are you able to provide some explicit steps/example rollout to reproduce the issue?
  4. If easily reproducible, could you see if the issue still exists in v1.3.0-rc1?

We had some discussion about this. The bug might be in reconcileTrafficRouting() where we are adjusting weights even though selectors are not yet swapped, and the fix may need to consider current label selectors before making the weight change of 0% canary.

@jandersen-plaid
Copy link
Author

Hey @jessesuen thanks for following up here!

Some questions:

  1. What version of Rollouts are you running?
    We are running argo rollouts v1.2.0:
      containers:
      - image: "245200388354.dkr.ecr.us-east-1.amazonaws.com/quay/argoproj/argo-rollouts:v1.2.0"
        args:
        - --leader-elect
        imagePullPolicy: IfNotPresent
        name: argo-rollouts
  1. Could you paste the Rollout spec so we can understand all the options in play?

Here it is:

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  labels:    
    app.kubernetes.io/managed-by: "Helm"
    app.kubernetes.io/name: "server"
    app.kubernetes.io/component: "server"
    k8s.plaid.io/team: "cs"
    app.kubernetes.io/version: "3.20.0"
    helm.sh/chart: "plaid-app-3.20.0"
  name: server
  namespace: cs-team
spec:
  replicas: 450
  strategy:
    canary:
      dynamicStableScale: true
      canaryService: server-canary
      canaryMetadata:
        labels:
          k8s.plaid.io/deployment-role: canary
      stableService: server-stable
      stableMetadata:
        labels:
          k8s.plaid.io/deployment-role: stable

      trafficRouting:
        smi:
          trafficSplitName: server-traffic-split
          rootService: server
      analysis:
        templates:
        - templateName: server-analysis
        args:
        - name: service
          valueFrom:
            fieldRef:
              fieldPath: "metadata.labels['app.kubernetes.io/name']"
        - name: commit
          valueFrom:
            fieldRef:
              fieldPath: "metadata.labels['k8s.plaid.io/commit-sha']"
        - name: stable-hash
          valueFrom:
            podTemplateHashValue: Stable
        - name: latest-hash
          valueFrom:
            podTemplateHashValue: Latest
      steps:
        - setWeight: 1
        - pause:
            duration: 10m0s
        - setWeight: 10
        - pause:
            duration: 20m0s
        - setWeight: 25
        - pause:
            duration: 30m0s
        - setWeight: 50
        - pause:
            duration: 30m0s
        - setWeight: 100
        - pause:
            duration: 30m0s
  revisionHistoryLimit: 2
  selector:
    matchLabels:      
      app.kubernetes.io/managed-by: "Helm"
      app.kubernetes.io/name: "server"
      app.kubernetes.io/component: "server"
      k8s.plaid.io/team: "cs"
  template:    
    metadata:
      labels:    
        app.kubernetes.io/managed-by: "Helm"
        app.kubernetes.io/name: "server"
        app.kubernetes.io/component: "server"
        k8s.plaid.io/team: "cs"
        app.kubernetes.io/version: "3.20.0"
        helm.sh/chart: "plaid-app-3.20.0"    
        k8s.plaid.io/commit-sha: "05354ced8b0e3597227a5d6958aab2222dbe28d3"
      annotations:
        iam.amazonaws.com/role: k8s-server-production
        cluster-autoscaler.kubernetes.io/safe-to-evict: "true"
        linkerd.io/inject: enabled
        config.linkerd.io/skip-outbound-ports: "25,3306,5432,6379,11211,27017"
        config.linkerd.io/close-wait-timeout: "3600s"
        config.linkerd.io/proxy-cpu-request: "100m"
        config.linkerd.io/proxy-memory-request: "100Mi"
        config.linkerd.io/proxy-cpu-limit: ""
        config.linkerd.io/proxy-memory-limit: "1Gi"
        config.alpha.linkerd.io/proxy-wait-before-exit-seconds: "185"
        config.linkerd.io/proxy-log-format: "json"
        config.linkerd.io/proxy-log-level: warn,linkerd2_proxy=info    
    spec:
      terminationGracePeriodSeconds: 185
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: k8s.plaid.io/spot
                operator: DoesNotExist
                values: []
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - podAffinityTerm:
              labelSelector:
                matchLabels:
                  app.kubernetes.io/component: server
                  app.kubernetes.io/name: server
              topologyKey: kubernetes.io/hostname
            weight: 100
      containers:
      - name: app
        image: "server:05354ced"
        resources:
          limits:
            cpu: 8000m
            memory: 16Gi
          requests:
            cpu: 4000m
            memory: 12Gi
        livenessProbe:
          httpGet:
            path: /health
            port: 8025
          initialDelaySeconds: 60
          periodSeconds: 60
          successThreshold: 1 # liveness must be 1
          failureThreshold: 5
          timeoutSeconds: 60
        readinessProbe:
          httpGet:
            path: /health
            port: 8025
          initialDelaySeconds: 60
          periodSeconds: 60
          successThreshold: 1
          failureThreshold: 5
          timeoutSeconds: 60
        command:
        - "./start_server.sh"
        env:
        - name: PLAID_ENV
          value: "production"
        - name: K8S_METADATA_POD_NAME
          valueFrom:
            fieldRef:
              fieldPath: metadata.name
        - name: K8S_METADATA_NAMESPACE
          valueFrom:
            fieldRef:
              fieldPath: metadata.namespace
        - name: K8S_LABEL_APP_KUBERNETES_IO_MANAGED_BY
          valueFrom:
            fieldRef:
              fieldPath: metadata.labels['app.kubernetes.io/managed-by']
        - name: K8S_LABEL_APP_KUBERNETES_IO_NAME
          valueFrom:
            fieldRef:
              fieldPath: metadata.labels['app.kubernetes.io/name']
        - name: K8S_LABEL_APP_KUBERNETES_IO_TEAM
          valueFrom:
            fieldRef:
              fieldPath: metadata.labels['k8s.plaid.io/team']
        - name: K8S_LABEL_APP_KUBERNETES_IO_VERSION
          valueFrom:
            fieldRef:
              fieldPath: metadata.labels['app.kubernetes.io/version']
        - name: K8S_LABEL_HELM_SH_CHART
          valueFrom:
            fieldRef:
              fieldPath: metadata.labels['helm.sh/chart']
        - name: K8S_LABEL_K8S_PLAID_IO_NETWORK_ACCESS_GROUP
          valueFrom:
            fieldRef:
              fieldPath: metadata.labels['k8s.plaid.io/network-access-group']
        - name: K8S_HOST_IP
          valueFrom:
            fieldRef:
              fieldPath: status.hostIP
        - name: K8S_POD_IP
          valueFrom:
            fieldRef:
              fieldPath: status.podIP
        - name: K8S_NODE_NAME
          valueFrom:
            fieldRef:
              fieldPath: spec.nodeName
        ports:
        - name: main
          containerPort: 8024
        - name: metrics
          containerPort: 8025
      dnsConfig:
        options:
        - name: ndots
          value: "1"

The most key thing here is that dynamicStableScale is set.

  1. Are you able to provide some explicit steps/example rollout to reproduce the issue?

This is a bit tougher. From what I know, to reproduce the issue:

  1. Create a SMI canary rollout (at the very least)
  2. Set dynamicStableScale to true
  3. Have a pause duration on the final step for some number of minutes after the weight has been set to 100
  4. During this time, reduce the capacity on your cluster so that the replicasets are not all healthy (this must be done while the final step is finishing its pause duration)
  5. This should force argo to delay swapping the service selectors, but then call the SMI traffic routing to swap the weights on the trafficsplit.

At this point, I am not sure whether the bug applies just to SMI Rollouts, or any Rollout that does not need to have knowledge of the replicaset hash at any given time.

  1. If easily reproducible, could you see if the issue still exists in v1.3.0-rc1?

I can see if it still exists there, for sure! I don't think this is easily reproducible in a kind or minikube cluster yet, but I think I can take away some instances on one of our testing clusters to simulate this.

We had some discussion about this. The bug might be in reconcileTrafficRouting() where we are adjusting weights even though selectors are not yet swapped, and the fix may need to consider current label selectors before making the weight change of 0% canary.

I agree. I was looking into that when I last had to leave this work as we found a work around (just disable dynamicStableScale at the cost of many extra instances while deploying) and I was moved onto some other projects.

From what I could tell (starting from the point where we have delayed swapping the selectors and want to reconcile traffic routing at the final step):

&TrafficWeights{
    Canary:WeightDestination{
        Weight:0,
        ServiceName:server-canary,
        PodTemplateHash:7cf9c66479,
    },
    Stable:WeightDestination{
        Weight:100,
        ServiceName:server-stable,
        PodTemplateHash:7cf9c66479,
    },
    Additional:[]WeightDestination{},
    Verified:nil,
}
  • Normally, this would be okay, but the SMI Traffic Reconciler doesn't track or care about PodTemplateHash (
    backends := []smiv1alpha3.TrafficSplitBackend{{
    ) so we are switching traffic from the canary to the stable, but the stable points to a replicaset that has been at 0 replicas for the past 30 minutes.

The only way I could think to fix this was to make the traffic reconciler aware of the hashes by adding a map to the reconciler which is updated with the newest hashes in UpdateHash and then validating, when calling SetWeight, that the hashes on the WeightDestination are what we expect them to be in the cluster.

That being said, it could probably be simplified by just having UpdateHash check that the weight destination's PodTemplateHashs match what is in the cluster and erroring if they do not match.

@jandersen-plaid
Copy link
Author

jandersen-plaid commented Nov 8, 2022

oh no 😮‍💨 (really bad rebase)

@sonarcloud
Copy link

sonarcloud bot commented Nov 26, 2022

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 0 Code Smells

No Coverage information No Coverage information
0.0% 0.0% Duplication

@zachaller
Copy link
Collaborator

going to close in favor of #2441

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Argo rollouts will scale down stable when canary is missing pods
4 participants