After abort, rollout does not update Istio DestinationRule #1838

mksha · 2022-02-02T13:31:06Z

Summary

What happened/what you expected to happen?

When we abort the rollout then it should update the Istio destination rule so in destinationrule we have subsets with correct rollouts-pod-template-hash label.

But currently when we abort the rollout it does not update the destination rule, causing cannery endpoint to fail.

Diagnostics

What version of Argo Rollouts are you running?
0.3.0

# Paste the logs from the rollout controller
canary subset is not updated with correct rollouts-pod-template-hash label value.

# Logs for the entire controller:
kubectl logs -n argo-rollouts deployment/argo-rollouts

# Logs for a specific rollout:
kubectl logs -n argo-rollouts deployment/argo-rollouts | grep rollout=<ROLLOUTNAME>
logs does not have any reference for destinationrule update after abort action

Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritize the issues with the most 👍.

The text was updated successfully, but these errors were encountered:

huikang · 2022-02-02T14:26:46Z

Hi,@mksha, could you confirm if this happens to v0.3.0? Have you tried with v1.1.1 or the latest argo-rollouts image?

mksha · 2022-02-02T14:50:39Z

yes, we are using 1.1.1

huikang · 2022-02-02T15:33:26Z

@mksha , could you paste here the output from get rollout <rollout name> and get desinationrule <name> -o yaml? Thanks.

mksha · 2022-02-02T15:51:00Z

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  labels:
    region: us-east-1
    system_risk_class: "1"
  managedFields:
  - apiVersion: argoproj.io/v1alpha1
    fieldsType: FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          f:rollout.argoproj.io/revision: {}
      f:spec:
        f:replicas: {}
      f:status:
        .: {}
        f:HPAReplicas: {}
        f:availableReplicas: {}
        f:blueGreen: {}
        f:canary:
          .: {}
          f:weights:
            .: {}
            f:canary:
              .: {}
              f:podTemplateHash: {}
              f:weight: {}
            f:stable:
              .: {}
              f:podTemplateHash: {}
              f:weight: {}
        f:conditions: {}
        f:currentPodHash: {}
        f:currentStepHash: {}
        f:currentStepIndex: {}
        f:observedGeneration: {}
        f:phase: {}
        f:readyReplicas: {}
        f:replicas: {}
        f:selector: {}
        f:stableRS: {}
        f:updatedReplicas: {}
    manager: rollouts-controller
    operation: Update
    time: "2022-02-02T15:10:44Z"
  - apiVersion: argoproj.io/v1alpha1
    fieldsType: FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          .: {}
          f:kubectl.kubernetes.io/last-applied-configuration: {}
        f:labels:
          .: {}
          f:app.kubernetes.io/component: {}
          f:app.kubernetes.io/instance: {}
          f:app.kubernetes.io/managed-by: {}
          f:app.kubernetes.io/name: {}
          f:app.kubernetes.io/version: {}
          f:app_component: {}
          f:app_contacts: {}
          f:app_environment: {}
          f:application: {}
          f:argocd.argoproj.io/instance: {}
          f:business_unit: {}
          f:created_by: {}
          f:function: {}
          f:helm.sh/chart: {}
          f:network_environment: {}
          f:region: {}
          f:system_risk_class: {}
      f:spec:
        .: {}
        f:minReadySeconds: {}
        f:paused: {}
        f:progressDeadlineSeconds: {}
        f:revisionHistoryLimit: {}
        f:selector:
          .: {}
          f:matchLabels:
            .: {}
            f:app.kubernetes.io/component: {}
            f:app.kubernetes.io/instance: {}
            f:app.kubernetes.io/managed-by: {}
            f:app.kubernetes.io/name: {}
            f:app_component: {}
            f:app_environment: {}
            f:application: {}
            f:business_unit: {}
            f:function: {}
            f:network_environment: {}
            f:region: {}
        f:strategy:
          .: {}
          f:canary:
            .: {}
            f:antiAffinity:
              .: {}
              f:preferredDuringSchedulingIgnoredDuringExecution:
                .: {}
                f:weight: {}
            f:canaryMetadata:
              .: {}
              f:annotations:
                .: {}
                f:role: {}
              f:labels:
                .: {}
                f:role: {}
            f:stableMetadata:
              .: {}
              f:annotations:
                .: {}
                f:role: {}
              f:labels:
                .: {}
                f:role: {}
            f:steps: {}
            f:trafficRouting:
              .: {}
              f:istio:
                .: {}
                f:destinationRule:
                  .: {}
                  f:canarySubsetName: {}
                  f:name: {}
                  f:stableSubsetName: {}
                f:virtualServices: {}
        f:template:
          .: {}
          f:metadata:
            .: {}
            f:labels:
              .: {}
              f:app.kubernetes.io/component: {}
              f:app.kubernetes.io/instance: {}
              f:app.kubernetes.io/managed-by: {}
              f:app.kubernetes.io/name: {}
              f:app.kubernetes.io/version: {}
              f:app_component: {}
              f:app_contacts: {}
              f:app_environment: {}
              f:application: {}
              f:business_unit: {}
              f:created_by: {}
              f:function: {}
              f:helm.sh/chart: {}
              f:network_environment: {}
              f:region: {}
              f:system_risk_class: {}
          f:spec:
            .: {}
            f:affinity:
              .: {}
              f:nodeAffinity:
                .: {}
                f:preferredDuringSchedulingIgnoredDuringExecution: {}
              f:podAntiAffinity:
                .: {}
                f:preferredDuringSchedulingIgnoredDuringExecution: {}
            f:containers: {}
            f:imagePullSecrets: {}
            f:serviceAccountName: {}
            f:volumes: {}
    manager: argocd-application-controller
    operation: Update
    time: "2022-02-02T15:10:49Z"
  name: testapp
  namespace: test-ns
spec:
  minReadySeconds: 60
  paused: false
  progressDeadlineSeconds: 600
  replicas: 1
  revisionHistoryLimit: 5
  selector:
    matchLabels:
      app.kubernetes.io/component: testapp
      app.kubernetes.io/managed-by: Helm
      app.kubernetes.io/name: testapp
      app_component: testapp
      app_environment: dev
      application: testapp
      business_unit: bu
      function: appserver
      network_environment: dev
      region: us-east-1
  strategy:
    canary:
      antiAffinity:
        preferredDuringSchedulingIgnoredDuringExecution:
          weight: 1
      canaryMetadata:
        annotations:
          role: canary
        labels:
          role: canary
      stableMetadata:
        annotations:
          role: stable
        labels:
          role: stable
      steps:
      - setWeight: 10
      - pause: {}
      - setWeight: 50
      - pause: {}
      - setWeight: 100
      trafficRouting:
        istio:
          destinationRule:
            canarySubsetName: canary
            name: testapp
            stableSubsetName: stable
          virtualServices:
          - name: testapp-internal
            routes:
            - NewSessionDoc
            - NewSessionDocInternal
          - name: testapp-external
            routes:
            - NewSessionDoc
  template:
    metadata:
      labels:
        app.kubernetes.io/component: testapp
        app.kubernetes.io/managed-by: Helm
        app.kubernetes.io/name: testapp
        app.kubernetes.io/version: 09c502fab0a1043740c21566dec377ce9f17e3cb-9
        app_component: testapp
        app_environment: dev
        application: testapp
        business_unit: bu
        function: appserver
        helm.sh/chart: webapp-5.0.1-alpha.9
        network_environment: dev
        region: us-east-1
        system_risk_class: "1"
    spec:
      affinity:
        nodeAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - preference:
              matchExpressions:
              - key: workload
                operator: In
                values:
                - test
            weight: 50
          - preference:
              matchExpressions:
              - key: node-role.kubernetes.io/test
                operator: In
                values:
                - ""
            weight: 1
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: app.kubernetes.io/name
                  operator: In
                  values:
                  - testapp
                - key: app.kubernetes.io/version
                  operator: In
                  values:
                  - 09c502fab0a1043740c21566dec377ce9f17e3cb-9
              topologyKey: topology.kubernetes.io/zone
            weight: 51
          - podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: app.kubernetes.io/name
                  operator: In
                  values:
                  - testapp
                - key: app.kubernetes.io/version
                  operator: In
                  values:
                  - 09c502fab0a1043740c21566dec377ce9f17e3cb-9
              topologyKey: kubernetes.io/hostname
            weight: 50
      containers:
      - env:
        - name: JAVA_MS
          value: -Xms1g
        - name: JAVA_MX
          value: -Xmx1g
        - name: PROFILE_ENV
          value: dev
        image: testapp:09c502fab0a1043740c21566dec377ce9f17e3cb-9
        imagePullPolicy: Always
        livenessProbe:
          failureThreshold: 6
          httpGet:
            path: /keepalive.html
            port: 8080
            scheme: HTTP
          initialDelaySeconds: 150
          successThreshold: 1
        name: testapp
        ports:
        - containerPort: 8080
          name: http
          protocol: TCP
        readinessProbe:
          failureThreshold: 6
          httpGet:
            path: /keepalive.html
            port: 8080
            scheme: HTTP
          initialDelaySeconds: 150
          successThreshold: 1
        resources:
          limits:
            cpu: 1
            memory: 2Gi
          requests:
            cpu: 1
            memory: 1.5Gi
        volumeMounts:
        - mountPath: /web/company/docs/
          name: pv-volume
      imagePullSecrets:
      - name: credreg
      serviceAccountName: testapp
      volumes:
      - name: pv-volume
        nfs:
          path: /web/company/docs/
          server: company.com
status:
  HPAReplicas: 1
  availableReplicas: 1
  blueGreen: {}
  canary:
    weights:
      canary:
        podTemplateHash: 649fdf8c7d
        weight: 0
      stable:
        podTemplateHash: 649fdf8c7d
        weight: 100
  conditions:
  - lastTransitionTime: "2022-02-02T15:08:35Z"
    lastUpdateTime: "2022-02-02T15:08:35Z"
    message: Rollout is paused
    reason: RolloutPaused
    status: "False"
    type: Paused
  - lastTransitionTime: "2022-02-02T15:10:44Z"
    lastUpdateTime: "2022-02-02T15:10:44Z"
    message: Rollout has minimum availability
    reason: AvailableReason
    status: "True"
    type: Available
  - lastTransitionTime: "2022-02-02T15:13:19Z"
    lastUpdateTime: "2022-02-02T15:13:19Z"
    message: RolloutCompleted
    reason: RolloutCompleted
    status: "True"
    type: Completed
  - lastTransitionTime: "2022-02-02T15:08:35Z"
    lastUpdateTime: "2022-02-02T15:13:19Z"
    message: ReplicaSet "testapp-649fdf8c7d" has successfully progressed.
    reason: NewReplicaSetAvailable
    status: "True"
    type: Progressing
  currentPodHash: 649fdf8c7d
  currentStepHash: 5cd664bc9d
  currentStepIndex: 5
  observedGeneration: "115"
  phase: Healthy
  readyReplicas: 1
  replicas: 1
  selector: app.kubernetes.io/component=testapp,app.kubernetes.io/managed-by=Helm,app.kubernetes.io/name=testapp,app_component=testapp,app_environment=dev,application=testapp,business_unit=bu,function=appserver,network_environment=dev,region=us-east-1
  stableRS: 649fdf8c7d
  updatedReplicas: 1

mksha · 2022-02-02T15:51:34Z

@huikang

mksha · 2022-02-02T15:55:06Z

apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  annotations:
    argo-rollouts.argoproj.io/managed-by-rollouts: testapp
  creationTimestamp: "2021-11-29T13:48:24Z"
  generation: 147
  labels:
    app.kubernetes.io/component: testapp
  managedFields:
  - apiVersion: networking.istio.io/v1beta1
    fieldsType: FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          .: {}
          f:kubectl.kubernetes.io/last-applied-configuration: {}
        f:labels:
          .: {}
          f:app.kubernetes.io/component: {}
          f:app.kubernetes.io/instance: {}
          f:app.kubernetes.io/managed-by: {}
          f:app.kubernetes.io/name: {}
          f:app.kubernetes.io/version: {}
          f:app_component: {}
          f:app_contacts: {}
          f:app_environment: {}
          f:application: {}
          f:argocd.argoproj.io/instance: {}
          f:business_unit: {}
          f:created_by: {}
          f:function: {}
          f:helm.sh/chart: {}
          f:network_environment: {}
          f:region: {}
          f:system_risk_class: {}
      f:spec:
        .: {}
        f:host: {}
        f:trafficPolicy:
          .: {}
          f:loadBalancer:
            .: {}
            f:consistentHash:
              .: {}
              f:httpCookie:
                .: {}
                f:name: {}
                f:path: {}
                f:ttl: {}
    manager: argocd-application-controller
    operation: Update
    time: "2022-02-02T15:10:44Z"
  - apiVersion: networking.istio.io/v1alpha3
    fieldsType: FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          f:argo-rollouts.argoproj.io/managed-by-rollouts: {}
      f:spec:
        f:subsets: {}
    manager: rollouts-controller
    operation: Update
    time: "2022-02-02T15:10:44Z"
  name: testapp
  namespace: default
  resourceVersion: "1013555439"
  selfLink: /apis/networking.istio.io/v1beta1/namespaces/default/destinationrules/testapp
  uid: 4d21d1d1-e039-415d-942a-b7ce99dbdd93
spec:
  host: testapp
  subsets:
  - labels:
      app.kubernetes.io/component: testapp
      app.kubernetes.io/managed-by: Helm
      app.kubernetes.io/name: testapp
      app_component: testapp
      app_environment: dev
      application: testapp
      business_unit: pdo
      function: appserver
      network_environment: dev
      region: us-east-1
      rollouts-pod-template-hash: 56b5f6b98d
    name: canary
  - labels:
      app.kubernetes.io/component: testapp
      app.kubernetes.io/managed-by: Helm
      app.kubernetes.io/name: testapp
      app_component: testapp
      app_environment: dev
      application: testapp
      business_unit: pdo
      function: appserver
      network_environment: dev
      region: us-east-1
      rollouts-pod-template-hash: 649fdf8c7d
    name: stable
  trafficPolicy:
    loadBalancer:
      consistentHash:
        httpCookie:
          name: testapp
          path: /
          ttl: 1m0s

mksha · 2022-02-02T15:55:31Z

@huikang let me know if you need more details.
thanks

huikang · 2022-02-02T16:20:14Z

Hi, @mksha , thanks for the detailed info. It would be even better if you can provide the output of kubectl argo rollouts get rollout <rollout name> Thanks.

mksha · 2022-02-02T16:22:21Z

@jgwest @alexec

huikang · 2022-02-02T16:23:50Z

Also how about the status of the virtual service testapp-external?

huikang · 2022-02-02T16:43:02Z

I couldn't reproduce the error with the latest master branch: after abort an update, the virtualservice is set with 100 weght to the stable subset

http:
  - name: primary
    route:
    - destination:
        host: istio-rollout
        subset: stable
      weight: 100
    - destination:
        host: istio-rollout
        subset: canary
      weight: 0

The destination has the correct pod labels:

spec:
  host: istio-rollout
  subsets:
  - labels:
      app: istio-rollout
      rollouts-pod-template-hash: 8c48ddb95
    name: canary
  - labels:
      app: istio-rollout
      rollouts-pod-template-hash: f4dcdc666
    name: stable

# rollout status
  canary:
    weights:
      canary:
        podTemplateHash: 8c48ddb95
        weight: 0
      stable:
        podTemplateHash: f4dcdc666
        weight: 100

And the stable pod is still running

get replicaset
NAME                      DESIRED   CURRENT   READY   AGE
istio-rollout-8c48ddb95   0         0         0       7m20s
istio-rollout-f4dcdc666   1         1         1       7m36s

huikang · 2022-02-02T16:49:05Z

@mksha , in your case, I noticed the following in the rollout status

  canary:
    weights:
      canary:
        podTemplateHash: 649fdf8c7d
        weight: 0
      stable:
        podTemplateHash: 649fdf8c7d
        weight: 100

The canary pod is supposed to have a different hash value. I remember there was some bug related to this fixed after v1.1.1. Could you try using the latest version?

mksha · 2022-02-02T16:57:33Z

Name:            testapp
Namespace:       default
Status:          ✖ Degraded
Message:         RolloutAborted: Rollout aborted update to revision 11
Strategy:        Canary
  Step:          0/5
  SetWeight:     0
  ActualWeight:  0
Images:          testapp:09c502fab0a1043740c21566dec377ce9f17e3cb-9 (stable)
Replicas:
  Desired:       1
  Current:       1
  Updated:       0
  Ready:         1
  Available:     1

NAME                                   KIND        STATUS        AGE    INFO
⟳ gdocument                            Rollout     ✖ Degraded    65d    
├──# revision:11                                                        
│  └──⧉testapp-75dfb84574           ReplicaSet  • ScaledDown  27m    canary,delay:passed
├──# revision:10                                                        
│  └──⧉testapp-649fdf8c7d           ReplicaSet  ✔ Healthy     4h3m   stable
│     └──□testapp-649fdf8c7d-wf79v  Pod         ✔ Running     107m   ready:2/2
└──# revision:9                                                         
   └──⧉testapp-56b5f6b98d           ReplicaSet  • ScaledDown  4h13m

mksha · 2022-02-02T16:57:44Z

@huikang there you go

mksha · 2022-02-02T16:58:22Z

after the Aborting rollout

huikang · 2022-02-02T19:49:34Z

@mksha , could you try the latest image? I think the issue has been fixed there.

mksha · 2022-02-03T09:27:59Z

HI @huikang , after the abort in destinationrule both subsets have same rollouts-pod-template-hash: f4dcdc666, bec canary pod is not there to server the canary traffic, so if destination rules does not point to stable for both stable and canary then canary requests will fail.

mksha · 2022-02-03T09:30:45Z

@huikang do you mean trying image created from master branch ?
bec I am already using v1.1.1

mksha · 2022-02-03T09:50:09Z

canary:
    weights:
      canary:
        podTemplateHash: 5865cf698b
        weight: 0
      stable:
        podTemplateHash: 649fdf8c7d
        weight: 100

mksha · 2022-02-03T09:50:27Z

from the rollout that was aborted using image v1.1.1

mksha · 2022-02-03T09:53:20Z

But still destination rule is not updated , after canary pod deletion.

apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  annotations:
    argo-rollouts.argoproj.io/managed-by-rollouts: testapp
  name: testapp
spec:
  host: testapp
  subsets:
    - labels:
        app.kubernetes.io/component: testapp
        rollouts-pod-template-hash: 5865cf698b
      name: canary
    - labels:
        app.kubernetes.io/component: testapp
        rollouts-pod-template-hash: 649fdf8c7d
      name: stable

mksha · 2022-02-03T09:54:15Z

@huikang after the abort both subsets should point to stable hash

mksha · 2022-02-03T09:56:12Z

or the pod hash label should be removed completely after abort

mksha · 2022-02-03T11:47:04Z

I just figured out that
Before rollout starting and after rollout completed both canary and stable subsets should point to stable pod hash template

example

weights:
      canary:
        podTemplateHash: 74c7887f8
        weight: 0
      stable:
        podTemplateHash: 74c7887f8
        weight: 100

But in our case after abort, its

canary:
    weights:
      canary:
        podTemplateHash: 6d947796cd
        weight: 0
      stable:
        podTemplateHash: 74c7887f8
        weight: 100

that is what is being copied to destination rule that is why its causing issues.

huikang · 2022-02-03T18:52:36Z

Tested with the latest version and got the following after abort a canary update

# Rollout status
status:
  canary:
    currentBackgroundAnalysisRunStatus:
      message: Run Terminated
      name: istio-rollout-8c48ddb95-2
      status: Successful
    weights:
      canary:
        podTemplateHash: 8c48ddb95
        weight: 0
      stable:
        podTemplateHash: f4dcdc666
        weight: 100

The destinationrul matches the status :

spec:
  host: istio-rollout
  subsets:
  - labels:
      app: istio-rollout
      rollouts-pod-template-hash: 8c48ddb95
    name: canary
  - labels:
      app: istio-rollout
      rollouts-pod-template-hash: f4dcdc666
    name: stable

mksha · 2022-02-04T04:37:46Z

@huikang which means its wrong, bec in that case canary subset will point to the canary replicaset that does not have any pod running meaning canary requests will fail.

Ideally when we do the canary deployment in production, then some amount of traffic will be going to canary version of app and we will be performing some analysis and if analysis failed, meaning rollout aborted in that case, we will have canary subset pointing to canary replica set that does not have any pod meaning all clients using canary version will see issues.

So we need to have same behavior as we have at the end of full promotion where both canary and stable subset points to stable replicaset.

huikang · 2022-02-04T04:57:27Z

@huikang which means its wrong, bec in that case canary subset will point to the canary replicaset that does not have any pod running meaning canary requests will fail.

Although the canary of the rollout status points to the replicaset with 0 pods, the virtualservice have 0 weight to the canary rs; so no traffic will be sent to the replicaset.

So we need to have same behavior as we have at the end of full promotion where both canary and stable subset points to stable replicaset.

I agree that the status should reset to the status where canary and stable points to the stable replicaset after update is aborted.

mksha · 2022-02-04T06:31:11Z

even if virtual service is having weight 0 for the canary, there is a case when you have session affinity enabled then in that case your traffic will still go to canary replica set.becuase in virtual service you have that rule defined for our canary session affinity cookie.

Apart from that, let me know if i can help to fix this bug, i already found the place when we are updating the hash but not able to find why its diff that full promotion.

argo-rollouts/rollout/trafficrouting.go

Lines 130 to 179 in 5f0f8b4

    
           if rolloututil.IsFullyPromoted(c.rollout) { 
        
           	// when we are fully promoted. desired canary weight should be 0 
        
           } else if c.pauseContext.IsAborted() { 
        
           	// when aborted, desired canary weight should immediately be 0 (100% to stable), *unless* 
        
           	// we are using dynamic stable scaling. In that case, we are dynamically decreasing the 
        
           	// weight to the canary according to the availability of the stable (whatever it can support). 
        
           	if c.rollout.Spec.Strategy.Canary.DynamicStableScale { 
        
           		desiredWeight = 100 - ((100 * c.stableRS.Status.AvailableReplicas) / *c.rollout.Spec.Replicas) 
        
           		if c.rollout.Status.Canary.Weights != nil { 
        
           			// This ensures that if we are already at a lower weight, then we will not 
        
           			// increase the weight because stable availability is flapping (e.g. pod restarts) 
        
           			desiredWeight = minInt(desiredWeight, c.rollout.Status.Canary.Weights.Canary.Weight) 
        
           		} 
        
           	} 
        
           } else if c.newRS == nil || c.newRS.Status.AvailableReplicas == 0 { 
        
           	// when newRS is not available or replicas num is 0. never weight to canary 
        
           	weightDestinations = append(weightDestinations, c.calculateWeightDestinationsFromExperiment()...) 
        
           } else if c.rollout.Status.PromoteFull { 
        
           	// on a promote full, desired stable weight should be 0 (100% to canary), 
        
           	// But we can only increase canary weight according to available replica counts of the canary. 
        
           	// we will need to set the desiredWeight to 0 when the newRS is not available. 
        
           	if c.rollout.Spec.Strategy.Canary.DynamicStableScale { 
        
           		desiredWeight = (100 * c.newRS.Status.AvailableReplicas) / *c.rollout.Spec.Replicas 
        
           	} else if c.rollout.Status.Canary.Weights != nil { 
        
           		desiredWeight = c.rollout.Status.Canary.Weights.Canary.Weight 
        
           	} 
        
           } else if index != nil { 
        
           	atDesiredReplicaCount := replicasetutil.AtDesiredReplicaCountsForCanary(c.rollout, c.newRS, c.stableRS, c.otherRSs, nil) 
        
           	if !atDesiredReplicaCount && !c.rollout.Status.PromoteFull { 
        
           		// Use the previous weight since the new RS is not ready for a new weight 
        
           		for i := *index - 1; i >= 0; i-- { 
        
           			step := c.rollout.Spec.Strategy.Canary.Steps[i] 
        
           			if step.SetWeight != nil { 
        
           				desiredWeight = *step.SetWeight 
        
           				break 
        
           			} 
        
           		} 
        
           		weightDestinations = append(weightDestinations, c.calculateWeightDestinationsFromExperiment()...) 
        
           	} else if *index != int32(len(c.rollout.Spec.Strategy.Canary.Steps)) { 
        
           		// If the rollout is progressing through the steps, the desired 
        
           		// weight of the traffic routing service should be at the value of the 
        
           		// last setWeight step, which is set by GetCurrentSetWeight. 
        
           		desiredWeight = replicasetutil.GetCurrentSetWeight(c.rollout) 
        
           		weightDestinations = append(weightDestinations, c.calculateWeightDestinationsFromExperiment()...) 
        
           	} else { 
        
           		desiredWeight = 100 
        
           	} 
        
           } 
        
           err = reconciler.UpdateHash(canaryHash, stableHash, weightDestinations...)

if you take a look at above code in all case it calls updatehash, but we need to find a place where its session canaryHash to stableHash in case of full promotion.

mksha · 2022-02-04T11:23:51Z

@huikang

huikang · 2022-02-04T17:01:15Z

I think the root cause is

argo-rollouts/rollout/trafficrouting.go

Lines 126 to 128 in 1867742

    
           if c.newRS != nil { 
        
           	canaryHash = c.newRS.Labels[v1alpha1.DefaultRolloutUniqueLabelKey] 
        
           }

Even after the update is aborted, c.newRS still points to the 0-replica rs, which leaves the newRS hash in the status.canary.weights.canary.
Maybe we can set newRS to stable RS if newRS is being deleted
at

argo-rollouts/rollout/replicaset.go

Lines 144 to 147 in 0309eb6

    
           } else { 
        
           	c.log.Infof("RS '%s' has reached the scaleDownTime", c.newRS.Name) 
        
           	newReplicasCount = int32(0) 
        
           }

I might miss some cases. @jessesuen , could you provide any guidance?

jessesuen · 2022-02-05T00:23:57Z

Maybe we can set newRS to stable RS if newRS is being deleted

I don't think this would work because c.newRS should always point to the latest version and a lot of code relies on that being true.

even if virtual service is having weight 0 for the canary, there is a case when you have session affinity enabled then in that case your traffic will still go to canary replica set.becuase in virtual service you have that rule defined for our canary session affinity cookie.

Fundamentally, this is the issue. We currently assume that if we set a weight of 0, then no traffic will hit the RS. But you are explaining that this isn't always the case when used in conjunction with session affinity. Can you provide the example VirtualService/DestinationRule with session affinity that causes this so I can better understand?

When this combination of session affinity canary weight happens, how does istio decide which takes priority?

mksha · 2022-02-06T16:17:09Z

basically, in Virtualservice we have a rule based on that if
the host is canary.example.com
or
the request contains a session cookie requesting a canary version of the app then
it would route the traffic to the canary subset of the destination rule no matter what is the weight in virtual service to provide the session affinity.

In this case, if the destination rule canary subset is pointing to a pod that is not there after rollout abort then it will cause all canary requests to be failed.

mksha · 2022-02-07T05:29:47Z

So as a solution what we need is, we can keep the RS status whatever it is, but we can update the DestinationRule and just remove the pod-hash-template labels completely from both subset canary and stable when rollout is aborted. When we start the rollout again or retry then just set the destination rule to have labels of pod-hash-templates.

mksha · 2022-02-07T09:17:10Z

@jessesuen @huikang any feedback on above?

mksha · 2022-02-07T09:51:40Z

I think we need to have that login outside of if c.shouldDelayScaleDownOnAbort() {} block, because in case of rollout abort it will not go inside that If block at all.

mksha · 2022-02-15T04:45:13Z

@huikang @jessesuen any thoughts?

mksha · 2022-03-01T15:20:56Z

@jgwest anhy thoughts on this?

github-actions · 2023-02-20T02:41:36Z

This issue is stale because it has been open 60 days with no activity.

mehmanoglu · 2023-03-05T03:26:24Z

any news of this issue?

zachaller · 2023-03-25T13:55:49Z

Not yet would be open to PR's I might have time to get to it for 1.6 but no guarantee

github-actions · 2023-05-25T02:22:42Z

This issue is stale because it has been open 60 days with no activity.

Eslam-mohammed-anwar · 2023-11-28T14:29:43Z

We have the same issue with more serious situation
When rollout abort --> stable svc and canary svc dont have the same stabe hash --> mean when Scaling is happening (HPA) and dynamic stable scale is true --> we get events that VirtualService <service_name> set to desiredWeight '-500' --> which is negative value and we start getting 503 on the service until we force promote

Let us know if you need more debugging information

mksha added the bug Something isn't working label Feb 2, 2022

mksha mentioned this issue Feb 6, 2022

abortScaleDownDelaySeconds is not working as expected #1841

Open

harikrongali added this to the v1.4 milestone Oct 20, 2022

zachaller added enhancement New feature or request and removed bug Something isn't working labels Dec 2, 2022

zachaller removed this from the v1.4 milestone Dec 21, 2022

zachaller mentioned this issue Jan 24, 2023

fix: When Degraded state, canary-service doesn't have any endpoints #2536

Closed

6 tasks

zachaller mentioned this issue Feb 3, 2023

fix: switch service selector back to stable on canary service when aborted #2540

Merged

github-actions bot added the no-issue-activity label Feb 20, 2023

github-actions bot removed the no-issue-activity label Mar 6, 2023

github-actions bot added the no-issue-activity label May 25, 2023

kostis-codefresh added the istio label Jan 24, 2024

After abort, rollout does not update Istio DestinationRule #1838

After abort, rollout does not update Istio DestinationRule #1838

Comments

mksha commented Feb 2, 2022

Summary

Diagnostics

huikang commented Feb 2, 2022

mksha commented Feb 2, 2022

huikang commented Feb 2, 2022

mksha commented Feb 2, 2022

mksha commented Feb 2, 2022

mksha commented Feb 2, 2022

mksha commented Feb 2, 2022

huikang commented Feb 2, 2022

mksha commented Feb 2, 2022

huikang commented Feb 2, 2022

huikang commented Feb 2, 2022 • edited Loading

huikang commented Feb 2, 2022

mksha commented Feb 2, 2022

mksha commented Feb 2, 2022

mksha commented Feb 2, 2022

huikang commented Feb 2, 2022

mksha commented Feb 3, 2022

mksha commented Feb 3, 2022

mksha commented Feb 3, 2022

mksha commented Feb 3, 2022

mksha commented Feb 3, 2022

mksha commented Feb 3, 2022

mksha commented Feb 3, 2022

mksha commented Feb 3, 2022

huikang commented Feb 3, 2022

mksha commented Feb 4, 2022

huikang commented Feb 4, 2022 • edited Loading

mksha commented Feb 4, 2022

mksha commented Feb 4, 2022

huikang commented Feb 4, 2022

jessesuen commented Feb 5, 2022 • edited Loading

mksha commented Feb 6, 2022

mksha commented Feb 7, 2022

mksha commented Feb 7, 2022

mksha commented Feb 7, 2022

mksha commented Feb 15, 2022

mksha commented Mar 1, 2022

github-actions bot commented Feb 20, 2023

mehmanoglu commented Mar 5, 2023

zachaller commented Mar 25, 2023

github-actions bot commented May 25, 2023

Eslam-mohammed-anwar commented Nov 28, 2023 • edited Loading

huikang commented Feb 2, 2022 •

edited

Loading

huikang commented Feb 4, 2022 •

edited

Loading

jessesuen commented Feb 5, 2022 •

edited

Loading

Eslam-mohammed-anwar commented Nov 28, 2023 •

edited

Loading