Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

experiments: recovered from panic, on v1.4 #2608

Open
Lykathia opened this issue Feb 19, 2023 · 10 comments
Open

experiments: recovered from panic, on v1.4 #2608

Lykathia opened this issue Feb 19, 2023 · 10 comments
Labels
bug Something isn't working no-issue-activity

Comments

@Lykathia
Copy link

When running an experiment, a number of nil pointer exceptions are thrown - before the experiment eventually passes and moseys on its merry way.

We use istio in our environment.

It seems to resolve itself after a dozen+ failures.

To Reproduce

spec:
  analysis:
    successfulRunHistoryLimit: 1
    unsuccessfulRunHistoryLimit: 1
  progressDeadlineAbort: true
  replicas: 2
  revisionHistoryLimit: 5
  strategy:
    canary:
      steps:
        - experiment:
            analyses:
              - args:
                  - name: api-root-url
                    value: >-
                      http://example-preview.default.svc.cluster.local:8080
                  - name: request-timeout-milliseconds
                    value: '1000'
                name: example-smoke-test-analysis
                requiredForCompletion: true
                templateName: example-smoke-test-analysis
            duration: 5m
            templates:
              - metadata:
                  labels:
                    app.kubernetes.io/name: example-preview
                name: preview
                replicas: 1
                selector:
                  matchLabels:
                    app.kubernetes.io/name: example-preview
                specRef: canary
        - setWeight: 5
        - pause:
            duration: 10m
      trafficRouting:
        istio:
          destinationRule:
            canarySubsetName: canary
            name: example-mesh-destination
            stableSubsetName: stable
          virtualService:
            name: example-mesh-virtualservice
  workloadRef:
    apiVersion: apps/v1
    kind: Deployment
    name: example-svc-deployment

Expected behavior

The logs to not raise errors of nil pointer exceptions

Version

1.4

Logs

Recovered from panic: runtime error: invalid memory address or nil pointer dereference
goroutine 325 [running]:
runtime/debug.Stack()
	/usr/local/go/src/runtime/debug/stack.go:24 +0x65
github.com/argoproj/argo-rollouts/utils/controller.processNextWorkItem.func1.1.1()
	/go/src/github.com/argoproj/argo-rollouts/utils/controller/controller.go:149 +0x58
panic({0x21c1260, 0x3c0ff10})
	/usr/local/go/src/runtime/panic.go:884 +0x212
github.com/argoproj/argo-rollouts/rollout.(*rolloutContext).calculateWeightDestinationsFromExperiment(0xc00198ac00)
	/go/src/github.com/argoproj/argo-rollouts/rollout/trafficrouting.go:333 +0x1f7
github.com/argoproj/argo-rollouts/rollout.(*rolloutContext).reconcileTrafficRouting(0xc00198ac00)
	/go/src/github.com/argoproj/argo-rollouts/rollout/trafficrouting.go:176 +0x7c5
github.com/argoproj/argo-rollouts/rollout.(*rolloutContext).rolloutCanary(0xc00198ac00)
	/go/src/github.com/argoproj/argo-rollouts/rollout/canary.go:56 +0x166
github.com/argoproj/argo-rollouts/rollout.(*rolloutContext).reconcile(0xc00198ac00)
	/go/src/github.com/argoproj/argo-rollouts/rollout/context.go:86 +0xe7
github.com/argoproj/argo-rollouts/rollout.(*Controller).syncHandler(0xc0000e48c0, {0x29d81b0, 0xc000806c00}, {0xc0023f39b0, 0x27})
	/go/src/github.com/argoproj/argo-rollouts/rollout/controller.go:415 +0x4d3
github.com/argoproj/argo-rollouts/utils/controller.processNextWorkItem.func1.1()
	/go/src/github.com/argoproj/argo-rollouts/utils/controller/controller.go:153 +0x89
github.com/argoproj/argo-rollouts/utils/controller.processNextWorkItem.func1({0x29e4220?, 0xc000522520}, {0x25881fc, 0x7}, 0xc0009a7e70, {0x29d81b0, 0xc000806c00}, 0xc0005812c0?, {0x2093520, 0xc001bdde70})
	/go/src/github.com/argoproj/argo-rollouts/utils/controller/controller.go:157 +0x40b
github.com/argoproj/argo-rollouts/utils/controller.processNextWorkItem({0x29d81b0, 0xc000806c00}, {0x29e4220, 0xc000522520}, {0x25881fc, 0x7}, 0x0?, 0x0?)
	/go/src/github.com/argoproj/argo-rollouts/utils/controller/controller.go:171 +0xbf
github.com/argoproj/argo-rollouts/utils/controller.RunWorker(...)
	/go/src/github.com/argoproj/argo-rollouts/utils/controller/controller.go:104
github.com/argoproj/argo-rollouts/rollout.(*Controller).Run.func1()
	/go/src/github.com/argoproj/argo-rollouts/rollout/controller.go:336 +0xbe
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x0?)
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:155 +0x3e
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0x0?, {0x29b80e0, 0xc00150ccf0}, 0x1, 0xc00017d0e0)
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:156 +0xb6
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0x0?, 0x3b9aca00, 0x0, 0x0?, 0x0?)
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:133 +0x89
k8s.io/apimachinery/pkg/util/wait.Until(0x0?, 0x0?, 0x0?)
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:90 +0x25
created by github.com/argoproj/argo-rollouts/rollout.(*Controller).Run
	/go/src/github.com/argoproj/argo-rollouts/rollout/controller.go:335 +0xa7
@zachaller
Copy link
Collaborator

@Lykathia Can you try v1.4.1

@Lykathia
Copy link
Author

@zachaller yep, PR in - will get it reviewed and tested on Monday and report back. Thanks!

@Lykathia
Copy link
Author

1.4.1 doesn't not appear to have addressed the issue.

Recovered from panic: runtime error: invalid memory address or nil pointer dereference
goroutine 337 [running]:
runtime/debug.Stack()
	/usr/local/go/src/runtime/debug/stack.go:24 +0x65
github.com/argoproj/argo-rollouts/utils/controller.processNextWorkItem.func1.1.1()
	/go/src/github.com/argoproj/argo-rollouts/utils/controller/controller.go:149 +0x58
panic({0x21c24a0, 0x3c11f30})
	/usr/local/go/src/runtime/panic.go:884 +0x212
github.com/argoproj/argo-rollouts/rollout.(*rolloutContext).calculateWeightDestinationsFromExperiment(0xc0019b8c00)
	/go/src/github.com/argoproj/argo-rollouts/rollout/trafficrouting.go:333 +0x1f7
github.com/argoproj/argo-rollouts/rollout.(*rolloutContext).reconcileTrafficRouting(0xc0019b8c00)
	/go/src/github.com/argoproj/argo-rollouts/rollout/trafficrouting.go:176 +0x7c5
github.com/argoproj/argo-rollouts/rollout.(*rolloutContext).rolloutCanary(0xc0019b8c00)
	/go/src/github.com/argoproj/argo-rollouts/rollout/canary.go:56 +0x166
github.com/argoproj/argo-rollouts/rollout.(*rolloutContext).reconcile(0xc0019b8c00)
	/go/src/github.com/argoproj/argo-rollouts/rollout/context.go:86 +0xe7
github.com/argoproj/argo-rollouts/rollout.(*Controller).syncHandler(0xc000210700, {0x29d99f0, 0xc0002ea100}, {0xc002159dd0, 0x27})
	/go/src/github.com/argoproj/argo-rollouts/rollout/controller.go:415 +0x4d3
github.com/argoproj/argo-rollouts/utils/controller.processNextWorkItem.func1.1()
	/go/src/github.com/argoproj/argo-rollouts/utils/controller/controller.go:153 +0x89
github.com/argoproj/argo-rollouts/utils/controller.processNextWorkItem.func1({0x29e5a60?, 0xc000512c20}, {0x25899bc, 0x7}, 0xc000e4de70, {0x29d99f0, 0xc0002ea100}, 0xc000317080?, {0x2094760, 0xc002dc11d0})
	/go/src/github.com/argoproj/argo-rollouts/utils/controller/controller.go:157 +0x40b
github.com/argoproj/argo-rollouts/utils/controller.processNextWorkItem({0x29d99f0, 0xc0002ea100}, {0x29e5a60, 0xc000512c20}, {0x25899bc, 0x7}, 0x2094760?, 0x29b0250?)
	/go/src/github.com/argoproj/argo-rollouts/utils/controller/controller.go:171 +0xbf
github.com/argoproj/argo-rollouts/utils/controller.RunWorker(...)
	/go/src/github.com/argoproj/argo-rollouts/utils/controller/controller.go:104
github.com/argoproj/argo-rollouts/rollout.(*Controller).Run.func1()
	/go/src/github.com/argoproj/argo-rollouts/rollout/controller.go:336 +0xbe
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x3332303622?)
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:155 +0x3e
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0x3a22736c6562616c?, {0x29b9900, 0xc000d63590}, 0x1, 0xc00055d740)
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:156 +0xb6
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0x706d65742d646f70?, 0x3b9aca00, 0x0, 0x37?, 0x61746f6e6e61222c?)
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:133 +0x89
k8s.io/apimachinery/pkg/util/wait.Until(0x6a6f72706f677261?, 0x636e79732f6f692e?, 0x223a22657661772d?)
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:90 +0x25
created by github.com/argoproj/argo-rollouts/rollout.(*Controller).Run
	/go/src/github.com/argoproj/argo-rollouts/rollout/controller.go:335 +0xa7

@zachaller
Copy link
Collaborator

@Lykathia I would love to see your rollout object with the status field intact if you could share that it would be awesome. I also have a question if you are trying to use a weight on your experiment which I should be able to gather from the sharing of your ro object.

@Lykathia
Copy link
Author

Sure thing! Some nouns edited and some annotations / labels removed

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  annotations:
    argocd.argoproj.io/sync-wave: '10'
    rollout.argoproj.io/revision: '46'
    rollout.argoproj.io/workload-generation: '52'
  creationTimestamp: '2022-06-03T17:50:51Z'
  generation: 39
  labels:
    app.kubernetes.io/instance: production-eks-prod-examplens-example
  name: example-rollout-c89ac2e8
  namespace: examplens
  resourceVersion: '614921626'
  uid: 86e21dfc-7adf-40f5-a9de-b5d368b0a843
spec:
  analysis:
    successfulRunHistoryLimit: 1
    unsuccessfulRunHistoryLimit: 1
  progressDeadlineAbort: true
  replicas: 2
  restartAt: '2022-11-09T19:40:00Z'
  revisionHistoryLimit: 5
  strategy:
    canary:
      steps:
        - experiment:
            analyses:
              - args:
                  - name: api-root-url
                    value: >-
                      http://example-rollout-preview.examplens.svc.cluster.local:8080
                  - name: request-timeout-milliseconds
                    value: '1000'
                name: example-smoke-test-analysis
                requiredForCompletion: true
                templateName: example-smoke-test-analysis
            duration: 5m
            templates:
              - metadata:
                  labels:
                    app.kubernetes.io/name: example-rollout-preview
                name: preview
                replicas: 1
                selector:
                  matchLabels:
                    app.kubernetes.io/name: example-rollout-preview
                specRef: canary
        - setWeight: 5
        - pause:
            duration: 10m
      trafficRouting:
        istio:
          destinationRule:
            canarySubsetName: canary
            name: example-mesh-destination-c833f030
            stableSubsetName: stable
          virtualService:
            name: example-mesh-virtualservice-c8e2aa97
  workloadRef:
    apiVersion: apps/v1
    kind: Deployment
    name: example-svc-deployment-c8e30830
status:
  HPAReplicas: 2
  availableReplicas: 2
  blueGreen: {}
  canary:
    weights:
      canary:
        podTemplateHash: 5ccdfd574c
        weight: 0
      stable:
        podTemplateHash: 5ccdfd574c
        weight: 100
  conditions:
    - lastTransitionTime: '2023-02-27T14:45:05Z'
      lastUpdateTime: '2023-02-27T14:45:05Z'
      message: Rollout has minimum availability
      reason: AvailableReason
      status: 'True'
      type: Available
    - lastTransitionTime: '2023-02-27T15:12:21Z'
      lastUpdateTime: '2023-02-27T15:12:21Z'
      message: Rollout is paused
      reason: RolloutPaused
      status: 'False'
      type: Paused
    - lastTransitionTime: '2023-02-27T15:12:32Z'
      lastUpdateTime: '2023-02-27T15:12:32Z'
      message: RolloutCompleted
      reason: RolloutCompleted
      status: 'True'
      type: Completed
    - lastTransitionTime: '2023-02-27T15:13:02Z'
      lastUpdateTime: '2023-02-27T15:13:02Z'
      message: Rollout is healthy
      reason: RolloutHealthy
      status: 'True'
      type: Healthy
    - lastTransitionTime: '2023-02-27T15:12:21Z'
      lastUpdateTime: '2023-02-27T15:13:02Z'
      message: >-
        ReplicaSet "example-rollout-c89ac2e8-5ccdfd574c" has successfully
        progressed.
      reason: NewReplicaSetAvailable
      status: 'True'
      type: Progressing
  currentPodHash: 5ccdfd574c
  currentStepHash: 75dcbd8f68
  currentStepIndex: 3
  observedGeneration: '39'
  phase: Healthy
  readyReplicas: 2
  replicas: 2
  restartedAt: '2022-11-09T19:40:00Z'
  selector: app.kubernetes.io/name=example-svc-c8333369
  stableRS: 5ccdfd574c
  updatedReplicas: 2
  workloadObservedGeneration: '52'

everything seems to complete fine and as expected, just the logs are exploding w/ the NPEs.

@github-actions
Copy link
Contributor

github-actions bot commented Mar 5, 2023

This issue is stale because it has awaiting-response label for 5 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@github-actions
Copy link
Contributor

github-actions bot commented May 5, 2023

This issue is stale because it has been open 60 days with no activity.

@zachaller
Copy link
Collaborator

Maybe related #2734

@github-actions
Copy link
Contributor

github-actions bot commented Jul 5, 2023

This issue is stale because it has been open 60 days with no activity.

@zimmertr
Copy link

I saw similar exceptions occurring when calculateWeightDestinationsFromExperiment was called. My guess was that Argo was calling this function to determine the weight the experiment should have and it was returning NULL or an unexpected object or something. Then Argo would try and adjust the Virtual Service weight to this value, and Istio's admission controller would reject the non-zero/positive integer value and Argo would fail to catch the exception gracefully.

I solved it by adding a weight: # to my experiment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working no-issue-activity
Projects
None yet
Development

No branches or pull requests

3 participants