Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

After abort, rollout does not update Istio DestinationRule #1838

Open
mksha opened this issue Feb 2, 2022 · 42 comments
Open

After abort, rollout does not update Istio DestinationRule #1838

mksha opened this issue Feb 2, 2022 · 42 comments
Labels

Comments

@mksha
Copy link
Contributor

mksha commented Feb 2, 2022

Summary

What happened/what you expected to happen?

When we abort the rollout then it should update the Istio destination rule so in destinationrule we have subsets with correct rollouts-pod-template-hash label.

But currently when we abort the rollout it does not update the destination rule, causing cannery endpoint to fail.

Diagnostics

What version of Argo Rollouts are you running?
0.3.0

# Paste the logs from the rollout controller
canary subset is not updated with correct rollouts-pod-template-hash label value.

# Logs for the entire controller:
kubectl logs -n argo-rollouts deployment/argo-rollouts

# Logs for a specific rollout:
kubectl logs -n argo-rollouts deployment/argo-rollouts | grep rollout=<ROLLOUTNAME>
logs does not have any reference for destinationrule update after abort action

Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritize the issues with the most 👍.

@mksha mksha added the bug Something isn't working label Feb 2, 2022
@huikang
Copy link
Member

huikang commented Feb 2, 2022

Hi,@mksha, could you confirm if this happens to v0.3.0? Have you tried with v1.1.1 or the latest argo-rollouts image?

@mksha
Copy link
Contributor Author

mksha commented Feb 2, 2022

yes, we are using 1.1.1

@huikang
Copy link
Member

huikang commented Feb 2, 2022

@mksha , could you paste here the output from get rollout <rollout name> and get desinationrule <name> -o yaml? Thanks.

@mksha
Copy link
Contributor Author

mksha commented Feb 2, 2022

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  labels:
    region: us-east-1
    system_risk_class: "1"
  managedFields:
  - apiVersion: argoproj.io/v1alpha1
    fieldsType: FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          f:rollout.argoproj.io/revision: {}
      f:spec:
        f:replicas: {}
      f:status:
        .: {}
        f:HPAReplicas: {}
        f:availableReplicas: {}
        f:blueGreen: {}
        f:canary:
          .: {}
          f:weights:
            .: {}
            f:canary:
              .: {}
              f:podTemplateHash: {}
              f:weight: {}
            f:stable:
              .: {}
              f:podTemplateHash: {}
              f:weight: {}
        f:conditions: {}
        f:currentPodHash: {}
        f:currentStepHash: {}
        f:currentStepIndex: {}
        f:observedGeneration: {}
        f:phase: {}
        f:readyReplicas: {}
        f:replicas: {}
        f:selector: {}
        f:stableRS: {}
        f:updatedReplicas: {}
    manager: rollouts-controller
    operation: Update
    time: "2022-02-02T15:10:44Z"
  - apiVersion: argoproj.io/v1alpha1
    fieldsType: FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          .: {}
          f:kubectl.kubernetes.io/last-applied-configuration: {}
        f:labels:
          .: {}
          f:app.kubernetes.io/component: {}
          f:app.kubernetes.io/instance: {}
          f:app.kubernetes.io/managed-by: {}
          f:app.kubernetes.io/name: {}
          f:app.kubernetes.io/version: {}
          f:app_component: {}
          f:app_contacts: {}
          f:app_environment: {}
          f:application: {}
          f:argocd.argoproj.io/instance: {}
          f:business_unit: {}
          f:created_by: {}
          f:function: {}
          f:helm.sh/chart: {}
          f:network_environment: {}
          f:region: {}
          f:system_risk_class: {}
      f:spec:
        .: {}
        f:minReadySeconds: {}
        f:paused: {}
        f:progressDeadlineSeconds: {}
        f:revisionHistoryLimit: {}
        f:selector:
          .: {}
          f:matchLabels:
            .: {}
            f:app.kubernetes.io/component: {}
            f:app.kubernetes.io/instance: {}
            f:app.kubernetes.io/managed-by: {}
            f:app.kubernetes.io/name: {}
            f:app_component: {}
            f:app_environment: {}
            f:application: {}
            f:business_unit: {}
            f:function: {}
            f:network_environment: {}
            f:region: {}
        f:strategy:
          .: {}
          f:canary:
            .: {}
            f:antiAffinity:
              .: {}
              f:preferredDuringSchedulingIgnoredDuringExecution:
                .: {}
                f:weight: {}
            f:canaryMetadata:
              .: {}
              f:annotations:
                .: {}
                f:role: {}
              f:labels:
                .: {}
                f:role: {}
            f:stableMetadata:
              .: {}
              f:annotations:
                .: {}
                f:role: {}
              f:labels:
                .: {}
                f:role: {}
            f:steps: {}
            f:trafficRouting:
              .: {}
              f:istio:
                .: {}
                f:destinationRule:
                  .: {}
                  f:canarySubsetName: {}
                  f:name: {}
                  f:stableSubsetName: {}
                f:virtualServices: {}
        f:template:
          .: {}
          f:metadata:
            .: {}
            f:labels:
              .: {}
              f:app.kubernetes.io/component: {}
              f:app.kubernetes.io/instance: {}
              f:app.kubernetes.io/managed-by: {}
              f:app.kubernetes.io/name: {}
              f:app.kubernetes.io/version: {}
              f:app_component: {}
              f:app_contacts: {}
              f:app_environment: {}
              f:application: {}
              f:business_unit: {}
              f:created_by: {}
              f:function: {}
              f:helm.sh/chart: {}
              f:network_environment: {}
              f:region: {}
              f:system_risk_class: {}
          f:spec:
            .: {}
            f:affinity:
              .: {}
              f:nodeAffinity:
                .: {}
                f:preferredDuringSchedulingIgnoredDuringExecution: {}
              f:podAntiAffinity:
                .: {}
                f:preferredDuringSchedulingIgnoredDuringExecution: {}
            f:containers: {}
            f:imagePullSecrets: {}
            f:serviceAccountName: {}
            f:volumes: {}
    manager: argocd-application-controller
    operation: Update
    time: "2022-02-02T15:10:49Z"
  name: testapp
  namespace: test-ns
spec:
  minReadySeconds: 60
  paused: false
  progressDeadlineSeconds: 600
  replicas: 1
  revisionHistoryLimit: 5
  selector:
    matchLabels:
      app.kubernetes.io/component: testapp
      app.kubernetes.io/managed-by: Helm
      app.kubernetes.io/name: testapp
      app_component: testapp
      app_environment: dev
      application: testapp
      business_unit: bu
      function: appserver
      network_environment: dev
      region: us-east-1
  strategy:
    canary:
      antiAffinity:
        preferredDuringSchedulingIgnoredDuringExecution:
          weight: 1
      canaryMetadata:
        annotations:
          role: canary
        labels:
          role: canary
      stableMetadata:
        annotations:
          role: stable
        labels:
          role: stable
      steps:
      - setWeight: 10
      - pause: {}
      - setWeight: 50
      - pause: {}
      - setWeight: 100
      trafficRouting:
        istio:
          destinationRule:
            canarySubsetName: canary
            name: testapp
            stableSubsetName: stable
          virtualServices:
          - name: testapp-internal
            routes:
            - NewSessionDoc
            - NewSessionDocInternal
          - name: testapp-external
            routes:
            - NewSessionDoc
  template:
    metadata:
      labels:
        app.kubernetes.io/component: testapp
        app.kubernetes.io/managed-by: Helm
        app.kubernetes.io/name: testapp
        app.kubernetes.io/version: 09c502fab0a1043740c21566dec377ce9f17e3cb-9
        app_component: testapp
        app_environment: dev
        application: testapp
        business_unit: bu
        function: appserver
        helm.sh/chart: webapp-5.0.1-alpha.9
        network_environment: dev
        region: us-east-1
        system_risk_class: "1"
    spec:
      affinity:
        nodeAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - preference:
              matchExpressions:
              - key: workload
                operator: In
                values:
                - test
            weight: 50
          - preference:
              matchExpressions:
              - key: node-role.kubernetes.io/test
                operator: In
                values:
                - ""
            weight: 1
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: app.kubernetes.io/name
                  operator: In
                  values:
                  - testapp
                - key: app.kubernetes.io/version
                  operator: In
                  values:
                  - 09c502fab0a1043740c21566dec377ce9f17e3cb-9
              topologyKey: topology.kubernetes.io/zone
            weight: 51
          - podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: app.kubernetes.io/name
                  operator: In
                  values:
                  - testapp
                - key: app.kubernetes.io/version
                  operator: In
                  values:
                  - 09c502fab0a1043740c21566dec377ce9f17e3cb-9
              topologyKey: kubernetes.io/hostname
            weight: 50
      containers:
      - env:
        - name: JAVA_MS
          value: -Xms1g
        - name: JAVA_MX
          value: -Xmx1g
        - name: PROFILE_ENV
          value: dev
        image: testapp:09c502fab0a1043740c21566dec377ce9f17e3cb-9
        imagePullPolicy: Always
        livenessProbe:
          failureThreshold: 6
          httpGet:
            path: /keepalive.html
            port: 8080
            scheme: HTTP
          initialDelaySeconds: 150
          successThreshold: 1
        name: testapp
        ports:
        - containerPort: 8080
          name: http
          protocol: TCP
        readinessProbe:
          failureThreshold: 6
          httpGet:
            path: /keepalive.html
            port: 8080
            scheme: HTTP
          initialDelaySeconds: 150
          successThreshold: 1
        resources:
          limits:
            cpu: 1
            memory: 2Gi
          requests:
            cpu: 1
            memory: 1.5Gi
        volumeMounts:
        - mountPath: /web/company/docs/
          name: pv-volume
      imagePullSecrets:
      - name: credreg
      serviceAccountName: testapp
      volumes:
      - name: pv-volume
        nfs:
          path: /web/company/docs/
          server: company.com
status:
  HPAReplicas: 1
  availableReplicas: 1
  blueGreen: {}
  canary:
    weights:
      canary:
        podTemplateHash: 649fdf8c7d
        weight: 0
      stable:
        podTemplateHash: 649fdf8c7d
        weight: 100
  conditions:
  - lastTransitionTime: "2022-02-02T15:08:35Z"
    lastUpdateTime: "2022-02-02T15:08:35Z"
    message: Rollout is paused
    reason: RolloutPaused
    status: "False"
    type: Paused
  - lastTransitionTime: "2022-02-02T15:10:44Z"
    lastUpdateTime: "2022-02-02T15:10:44Z"
    message: Rollout has minimum availability
    reason: AvailableReason
    status: "True"
    type: Available
  - lastTransitionTime: "2022-02-02T15:13:19Z"
    lastUpdateTime: "2022-02-02T15:13:19Z"
    message: RolloutCompleted
    reason: RolloutCompleted
    status: "True"
    type: Completed
  - lastTransitionTime: "2022-02-02T15:08:35Z"
    lastUpdateTime: "2022-02-02T15:13:19Z"
    message: ReplicaSet "testapp-649fdf8c7d" has successfully progressed.
    reason: NewReplicaSetAvailable
    status: "True"
    type: Progressing
  currentPodHash: 649fdf8c7d
  currentStepHash: 5cd664bc9d
  currentStepIndex: 5
  observedGeneration: "115"
  phase: Healthy
  readyReplicas: 1
  replicas: 1
  selector: app.kubernetes.io/component=testapp,app.kubernetes.io/managed-by=Helm,app.kubernetes.io/name=testapp,app_component=testapp,app_environment=dev,application=testapp,business_unit=bu,function=appserver,network_environment=dev,region=us-east-1
  stableRS: 649fdf8c7d
  updatedReplicas: 1

@mksha
Copy link
Contributor Author

mksha commented Feb 2, 2022

@huikang

@mksha
Copy link
Contributor Author

mksha commented Feb 2, 2022

apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  annotations:
    argo-rollouts.argoproj.io/managed-by-rollouts: testapp
  creationTimestamp: "2021-11-29T13:48:24Z"
  generation: 147
  labels:
    app.kubernetes.io/component: testapp
  managedFields:
  - apiVersion: networking.istio.io/v1beta1
    fieldsType: FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          .: {}
          f:kubectl.kubernetes.io/last-applied-configuration: {}
        f:labels:
          .: {}
          f:app.kubernetes.io/component: {}
          f:app.kubernetes.io/instance: {}
          f:app.kubernetes.io/managed-by: {}
          f:app.kubernetes.io/name: {}
          f:app.kubernetes.io/version: {}
          f:app_component: {}
          f:app_contacts: {}
          f:app_environment: {}
          f:application: {}
          f:argocd.argoproj.io/instance: {}
          f:business_unit: {}
          f:created_by: {}
          f:function: {}
          f:helm.sh/chart: {}
          f:network_environment: {}
          f:region: {}
          f:system_risk_class: {}
      f:spec:
        .: {}
        f:host: {}
        f:trafficPolicy:
          .: {}
          f:loadBalancer:
            .: {}
            f:consistentHash:
              .: {}
              f:httpCookie:
                .: {}
                f:name: {}
                f:path: {}
                f:ttl: {}
    manager: argocd-application-controller
    operation: Update
    time: "2022-02-02T15:10:44Z"
  - apiVersion: networking.istio.io/v1alpha3
    fieldsType: FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          f:argo-rollouts.argoproj.io/managed-by-rollouts: {}
      f:spec:
        f:subsets: {}
    manager: rollouts-controller
    operation: Update
    time: "2022-02-02T15:10:44Z"
  name: testapp
  namespace: default
  resourceVersion: "1013555439"
  selfLink: /apis/networking.istio.io/v1beta1/namespaces/default/destinationrules/testapp
  uid: 4d21d1d1-e039-415d-942a-b7ce99dbdd93
spec:
  host: testapp
  subsets:
  - labels:
      app.kubernetes.io/component: testapp
      app.kubernetes.io/managed-by: Helm
      app.kubernetes.io/name: testapp
      app_component: testapp
      app_environment: dev
      application: testapp
      business_unit: pdo
      function: appserver
      network_environment: dev
      region: us-east-1
      rollouts-pod-template-hash: 56b5f6b98d
    name: canary
  - labels:
      app.kubernetes.io/component: testapp
      app.kubernetes.io/managed-by: Helm
      app.kubernetes.io/name: testapp
      app_component: testapp
      app_environment: dev
      application: testapp
      business_unit: pdo
      function: appserver
      network_environment: dev
      region: us-east-1
      rollouts-pod-template-hash: 649fdf8c7d
    name: stable
  trafficPolicy:
    loadBalancer:
      consistentHash:
        httpCookie:
          name: testapp
          path: /
          ttl: 1m0s

@mksha
Copy link
Contributor Author

mksha commented Feb 2, 2022

@huikang let me know if you need more details.
thanks

@huikang
Copy link
Member

huikang commented Feb 2, 2022

Hi, @mksha , thanks for the detailed info. It would be even better if you can provide the output of kubectl argo rollouts get rollout <rollout name> Thanks.

@mksha
Copy link
Contributor Author

mksha commented Feb 2, 2022

@jgwest @alexec

@huikang
Copy link
Member

huikang commented Feb 2, 2022

Also how about the status of the virtual service testapp-external?

@huikang
Copy link
Member

huikang commented Feb 2, 2022

I couldn't reproduce the error with the latest master branch: after abort an update, the virtualservice is set with 100 weght to the stable subset

http:
  - name: primary
    route:
    - destination:
        host: istio-rollout
        subset: stable
      weight: 100
    - destination:
        host: istio-rollout
        subset: canary
      weight: 0

The destination has the correct pod labels:

spec:
  host: istio-rollout
  subsets:
  - labels:
      app: istio-rollout
      rollouts-pod-template-hash: 8c48ddb95
    name: canary
  - labels:
      app: istio-rollout
      rollouts-pod-template-hash: f4dcdc666
    name: stable
# rollout status
  canary:
    weights:
      canary:
        podTemplateHash: 8c48ddb95
        weight: 0
      stable:
        podTemplateHash: f4dcdc666
        weight: 100

And the stable pod is still running

get replicaset
NAME                      DESIRED   CURRENT   READY   AGE
istio-rollout-8c48ddb95   0         0         0       7m20s
istio-rollout-f4dcdc666   1         1         1       7m36s

@huikang
Copy link
Member

huikang commented Feb 2, 2022

@mksha , in your case, I noticed the following in the rollout status

  canary:
    weights:
      canary:
        podTemplateHash: 649fdf8c7d
        weight: 0
      stable:
        podTemplateHash: 649fdf8c7d
        weight: 100

The canary pod is supposed to have a different hash value. I remember there was some bug related to this fixed after v1.1.1. Could you try using the latest version?

@mksha
Copy link
Contributor Author

mksha commented Feb 2, 2022

Name:            testapp
Namespace:       default
Status:          ✖ Degraded
Message:         RolloutAborted: Rollout aborted update to revision 11
Strategy:        Canary
  Step:          0/5
  SetWeight:     0
  ActualWeight:  0
Images:          testapp:09c502fab0a1043740c21566dec377ce9f17e3cb-9 (stable)
Replicas:
  Desired:       1
  Current:       1
  Updated:       0
  Ready:         1
  Available:     1

NAME                                   KIND        STATUS        AGE    INFO
⟳ gdocument                            Rollout     ✖ Degraded    65d    
├──# revision:11                                                        
│  └──⧉testapp-75dfb84574           ReplicaSet  • ScaledDown  27m    canary,delay:passed
├──# revision:10                                                        
│  └──⧉testapp-649fdf8c7d           ReplicaSet  ✔ Healthy     4h3m   stable
│     └──□testapp-649fdf8c7d-wf79v  Pod         ✔ Running     107m   ready:2/2
└──# revision:9                                                         
   └──⧉testapp-56b5f6b98d           ReplicaSet  • ScaledDown  4h13m  

@mksha
Copy link
Contributor Author

mksha commented Feb 2, 2022

@huikang there you go

@mksha
Copy link
Contributor Author

mksha commented Feb 2, 2022

after the Aborting rollout

@huikang
Copy link
Member

huikang commented Feb 2, 2022

@mksha , could you try the latest image? I think the issue has been fixed there.

@mksha
Copy link
Contributor Author

mksha commented Feb 3, 2022

HI @huikang , after the abort in destinationrule both subsets have same rollouts-pod-template-hash: f4dcdc666, bec canary pod is not there to server the canary traffic, so if destination rules does not point to stable for both stable and canary then canary requests will fail.

@mksha
Copy link
Contributor Author

mksha commented Feb 3, 2022

@huikang do you mean trying image created from master branch ?
bec I am already using v1.1.1

@mksha
Copy link
Contributor Author

mksha commented Feb 3, 2022

canary:
    weights:
      canary:
        podTemplateHash: 5865cf698b
        weight: 0
      stable:
        podTemplateHash: 649fdf8c7d
        weight: 100

@mksha
Copy link
Contributor Author

mksha commented Feb 3, 2022

from the rollout that was aborted using image v1.1.1

@mksha
Copy link
Contributor Author

mksha commented Feb 3, 2022

But still destination rule is not updated , after canary pod deletion.

apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  annotations:
    argo-rollouts.argoproj.io/managed-by-rollouts: testapp
  name: testapp
spec:
  host: testapp
  subsets:
    - labels:
        app.kubernetes.io/component: testapp
        rollouts-pod-template-hash: 5865cf698b
      name: canary
    - labels:
        app.kubernetes.io/component: testapp
        rollouts-pod-template-hash: 649fdf8c7d
      name: stable

@mksha
Copy link
Contributor Author

mksha commented Feb 3, 2022

@huikang after the abort both subsets should point to stable hash

@mksha
Copy link
Contributor Author

mksha commented Feb 3, 2022

or the pod hash label should be removed completely after abort

@mksha
Copy link
Contributor Author

mksha commented Feb 3, 2022

I just figured out that
Before rollout starting and after rollout completed both canary and stable subsets should point to stable pod hash template

example

weights:
      canary:
        podTemplateHash: 74c7887f8
        weight: 0
      stable:
        podTemplateHash: 74c7887f8
        weight: 100

But in our case after abort, its

canary:
    weights:
      canary:
        podTemplateHash: 6d947796cd
        weight: 0
      stable:
        podTemplateHash: 74c7887f8
        weight: 100

that is what is being copied to destination rule that is why its causing issues.

@huikang
Copy link
Member

huikang commented Feb 3, 2022

Tested with the latest version and got the following after abort a canary update

# Rollout status
status:
  canary:
    currentBackgroundAnalysisRunStatus:
      message: Run Terminated
      name: istio-rollout-8c48ddb95-2
      status: Successful
    weights:
      canary:
        podTemplateHash: 8c48ddb95
        weight: 0
      stable:
        podTemplateHash: f4dcdc666
        weight: 100

The destinationrul matches the status :

spec:
  host: istio-rollout
  subsets:
  - labels:
      app: istio-rollout
      rollouts-pod-template-hash: 8c48ddb95
    name: canary
  - labels:
      app: istio-rollout
      rollouts-pod-template-hash: f4dcdc666
    name: stable

@mksha
Copy link
Contributor Author

mksha commented Feb 4, 2022

@huikang which means its wrong, bec in that case canary subset will point to the canary replicaset that does not have any pod running meaning canary requests will fail.

Ideally when we do the canary deployment in production, then some amount of traffic will be going to canary version of app and we will be performing some analysis and if analysis failed, meaning rollout aborted in that case, we will have canary subset pointing to canary replica set that does not have any pod meaning all clients using canary version will see issues.

So we need to have same behavior as we have at the end of full promotion where both canary and stable subset points to stable replicaset.

@huikang
Copy link
Member

huikang commented Feb 4, 2022

@huikang which means its wrong, bec in that case canary subset will point to the canary replicaset that does not have any pod running meaning canary requests will fail.

Although the canary of the rollout status points to the replicaset with 0 pods, the virtualservice have 0 weight to the canary rs; so no traffic will be sent to the replicaset.

So we need to have same behavior as we have at the end of full promotion where both canary and stable subset points to stable replicaset.

I agree that the status should reset to the status where canary and stable points to the stable replicaset after update is aborted.

@mksha
Copy link
Contributor Author

mksha commented Feb 4, 2022

even if virtual service is having weight 0 for the canary, there is a case when you have session affinity enabled then in that case your traffic will still go to canary replica set.becuase in virtual service you have that rule defined for our canary session affinity cookie.

Apart from that, let me know if i can help to fix this bug, i already found the place when we are updating the hash but not able to find why its diff that full promotion.

if rolloututil.IsFullyPromoted(c.rollout) {
// when we are fully promoted. desired canary weight should be 0
} else if c.pauseContext.IsAborted() {
// when aborted, desired canary weight should immediately be 0 (100% to stable), *unless*
// we are using dynamic stable scaling. In that case, we are dynamically decreasing the
// weight to the canary according to the availability of the stable (whatever it can support).
if c.rollout.Spec.Strategy.Canary.DynamicStableScale {
desiredWeight = 100 - ((100 * c.stableRS.Status.AvailableReplicas) / *c.rollout.Spec.Replicas)
if c.rollout.Status.Canary.Weights != nil {
// This ensures that if we are already at a lower weight, then we will not
// increase the weight because stable availability is flapping (e.g. pod restarts)
desiredWeight = minInt(desiredWeight, c.rollout.Status.Canary.Weights.Canary.Weight)
}
}
} else if c.newRS == nil || c.newRS.Status.AvailableReplicas == 0 {
// when newRS is not available or replicas num is 0. never weight to canary
weightDestinations = append(weightDestinations, c.calculateWeightDestinationsFromExperiment()...)
} else if c.rollout.Status.PromoteFull {
// on a promote full, desired stable weight should be 0 (100% to canary),
// But we can only increase canary weight according to available replica counts of the canary.
// we will need to set the desiredWeight to 0 when the newRS is not available.
if c.rollout.Spec.Strategy.Canary.DynamicStableScale {
desiredWeight = (100 * c.newRS.Status.AvailableReplicas) / *c.rollout.Spec.Replicas
} else if c.rollout.Status.Canary.Weights != nil {
desiredWeight = c.rollout.Status.Canary.Weights.Canary.Weight
}
} else if index != nil {
atDesiredReplicaCount := replicasetutil.AtDesiredReplicaCountsForCanary(c.rollout, c.newRS, c.stableRS, c.otherRSs, nil)
if !atDesiredReplicaCount && !c.rollout.Status.PromoteFull {
// Use the previous weight since the new RS is not ready for a new weight
for i := *index - 1; i >= 0; i-- {
step := c.rollout.Spec.Strategy.Canary.Steps[i]
if step.SetWeight != nil {
desiredWeight = *step.SetWeight
break
}
}
weightDestinations = append(weightDestinations, c.calculateWeightDestinationsFromExperiment()...)
} else if *index != int32(len(c.rollout.Spec.Strategy.Canary.Steps)) {
// If the rollout is progressing through the steps, the desired
// weight of the traffic routing service should be at the value of the
// last setWeight step, which is set by GetCurrentSetWeight.
desiredWeight = replicasetutil.GetCurrentSetWeight(c.rollout)
weightDestinations = append(weightDestinations, c.calculateWeightDestinationsFromExperiment()...)
} else {
desiredWeight = 100
}
}
err = reconciler.UpdateHash(canaryHash, stableHash, weightDestinations...)

if you take a look at above code in all case it calls updatehash, but we need to find a place where its session canaryHash to stableHash in case of full promotion.

@mksha
Copy link
Contributor Author

mksha commented Feb 4, 2022

@huikang

@huikang
Copy link
Member

huikang commented Feb 4, 2022

I think the root cause is

if c.newRS != nil {
canaryHash = c.newRS.Labels[v1alpha1.DefaultRolloutUniqueLabelKey]
}

Even after the update is aborted, c.newRS still points to the 0-replica rs, which leaves the newRS hash in the status.canary.weights.canary.
Maybe we can set newRS to stable RS if newRS is being deleted
at

} else {
c.log.Infof("RS '%s' has reached the scaleDownTime", c.newRS.Name)
newReplicasCount = int32(0)
}

I might miss some cases. @jessesuen , could you provide any guidance?

@jessesuen
Copy link
Member

jessesuen commented Feb 5, 2022

Maybe we can set newRS to stable RS if newRS is being deleted

I don't think this would work because c.newRS should always point to the latest version and a lot of code relies on that being true.

even if virtual service is having weight 0 for the canary, there is a case when you have session affinity enabled then in that case your traffic will still go to canary replica set.becuase in virtual service you have that rule defined for our canary session affinity cookie.

Fundamentally, this is the issue. We currently assume that if we set a weight of 0, then no traffic will hit the RS. But you are explaining that this isn't always the case when used in conjunction with session affinity. Can you provide the example VirtualService/DestinationRule with session affinity that causes this so I can better understand?

When this combination of session affinity canary weight happens, how does istio decide which takes priority?

@mksha
Copy link
Contributor Author

mksha commented Feb 6, 2022

basically, in Virtualservice we have a rule based on that if
the host is canary.example.com
or
the request contains a session cookie requesting a canary version of the app then
it would route the traffic to the canary subset of the destination rule no matter what is the weight in virtual service to provide the session affinity.

In this case, if the destination rule canary subset is pointing to a pod that is not there after rollout abort then it will cause all canary requests to be failed.

@mksha
Copy link
Contributor Author

mksha commented Feb 7, 2022

So as a solution what we need is, we can keep the RS status whatever it is, but we can update the DestinationRule and just remove the pod-hash-template labels completely from both subset canary and stable when rollout is aborted. When we start the rollout again or retry then just set the destination rule to have labels of pod-hash-templates.

@mksha
Copy link
Contributor Author

mksha commented Feb 7, 2022

@jessesuen @huikang any feedback on above?

@mksha
Copy link
Contributor Author

mksha commented Feb 7, 2022

I think we need to have that login outside of if c.shouldDelayScaleDownOnAbort() {} block, because in case of rollout abort it will not go inside that If block at all.

@mksha
Copy link
Contributor Author

mksha commented Feb 15, 2022

@huikang @jessesuen any thoughts?

@mksha
Copy link
Contributor Author

mksha commented Mar 1, 2022

@jgwest anhy thoughts on this?

@harikrongali harikrongali added this to the v1.4 milestone Oct 20, 2022
@zachaller zachaller added enhancement New feature or request and removed bug Something isn't working labels Dec 2, 2022
@zachaller zachaller removed this from the v1.4 milestone Dec 21, 2022
@github-actions
Copy link
Contributor

This issue is stale because it has been open 60 days with no activity.

@mehmanoglu
Copy link

any news of this issue?

@zachaller
Copy link
Collaborator

Not yet would be open to PR's I might have time to get to it for 1.6 but no guarantee

@github-actions
Copy link
Contributor

This issue is stale because it has been open 60 days with no activity.

@Eslam-mohammed-anwar
Copy link

Eslam-mohammed-anwar commented Nov 28, 2023

We have the same issue with more serious situation
When rollout abort --> stable svc and canary svc dont have the same stabe hash --> mean when Scaling is happening (HPA) and dynamic stable scale is true --> we get events that VirtualService <service_name> set to desiredWeight '-500' --> which is negative value and we start getting 503 on the service until we force promote

Let us know if you need more debugging information

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

8 participants