Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bug] Runtime panic in v1alpha3 when deleting extra trials #1222

Closed
gaocegege opened this issue Jun 18, 2020 · 15 comments · Fixed by #1223
Closed

[bug] Runtime panic in v1alpha3 when deleting extra trials #1222

gaocegege opened this issue Jun 18, 2020 · 15 comments · Fixed by #1223

Comments

@gaocegege
Copy link
Member

/kind bug

What steps did you take and what happened:
[A clear and concise description of what the bug is.]

{"level":"info","ts":1592453260.8852422,"logger":"experiment-controller","msg":"DeleteTrials","Experiment":"common1/experiment-20200618015209-kbs4gtfq","deleteCount":1}
E0618 04:07:40.885467       1 runtime.go:69] Observed a panic: "index out of range" (runtime error: index out of range)
/go/src/github.com/kubeflow/katib/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:76
/go/src/github.com/kubeflow/katib/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:65
/go/src/github.com/kubeflow/katib/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:51
/usr/local/go/src/runtime/panic.go:522
/usr/local/go/src/runtime/panic.go:44
/go/src/github.com/kubeflow/katib/pkg/controller.v1alpha3/experiment/experiment_controller.go:359
/go/src/github.com/kubeflow/katib/pkg/controller.v1alpha3/experiment/experiment_controller.go:283
/go/src/github.com/kubeflow/katib/pkg/controller.v1alpha3/experiment/experiment_controller.go:264
/go/src/github.com/kubeflow/katib/pkg/controller.v1alpha3/experiment/experiment_controller.go:222
/go/src/github.com/kubeflow/katib/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:215
/go/src/github.com/kubeflow/katib/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:158
/go/src/github.com/kubeflow/katib/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133
/go/src/github.com/kubeflow/katib/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:134
/go/src/github.com/kubeflow/katib/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88
/usr/local/go/src/runtime/asm_amd64.s:1337
panic: runtime error: index out of range [recovered]
	panic: runtime error: index out of range

goroutine 292 [running]:
github.com/kubeflow/katib/vendor/k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
	/go/src/github.com/kubeflow/katib/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:58 +0x105
panic(0x1547f40, 0x27efa20)
	/usr/local/go/src/runtime/panic.go:522 +0x1b5
github.com/kubeflow/katib/pkg/controller.v1alpha3/experiment.(*ReconcileExperiment).deleteTrials(0xc0008689c0, 0xc000e62000, 0x2827da8, 0x0, 0x0, 0x1, 0xc000d61200, 0x11)
	/go/src/github.com/kubeflow/katib/pkg/controller.v1alpha3/experiment/experiment_controller.go:359 +0x374
github.com/kubeflow/katib/pkg/controller.v1alpha3/experiment.(*ReconcileExperiment).ReconcileTrials(0xc0008689c0, 0xc000e62000, 0x2827da8, 0x0, 0x0, 0xc000401260, 0x0)
	/go/src/github.com/kubeflow/katib/pkg/controller.v1alpha3/experiment/experiment_controller.go:283 +0x29b
github.com/kubeflow/katib/pkg/controller.v1alpha3/experiment.(*ReconcileExperiment).ReconcileExperiment(0xc0008689c0, 0xc000e62000, 0x2827da8, 0x0)
	/go/src/github.com/kubeflow/katib/pkg/controller.v1alpha3/experiment/experiment_controller.go:264 +0x37f
github.com/kubeflow/katib/pkg/controller.v1alpha3/experiment.(*ReconcileExperiment).Reconcile(0xc0008689c0, 0xc000d5b009, 0x7, 0xc000d51680, 0x22, 0x2809e00, 0x0, 0x0, 0x0)
	/go/src/github.com/kubeflow/katib/pkg/controller.v1alpha3/experiment/experiment_controller.go:222 +0x50a
github.com/kubeflow/katib/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc0007312c0, 0x0)
	/go/src/github.com/kubeflow/katib/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:215 +0x1cc
github.com/kubeflow/katib/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1()
	/go/src/github.com/kubeflow/katib/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:158 +0x36
github.com/kubeflow/katib/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1(0xc000ab75f0)
	/go/src/github.com/kubeflow/katib/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133 +0x54
github.com/kubeflow/katib/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc000ab75f0, 0x3b9aca00, 0x0, 0x1, 0xc0007254a0)
	/go/src/github.com/kubeflow/katib/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:134 +0xf8
github.com/kubeflow/katib/vendor/k8s.io/apimachinery/pkg/util/wait.Until(0xc000ab75f0, 0x3b9aca00, 0xc0007254a0)
	/go/src/github.com/kubeflow/katib/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88 +0x4d
created by github.com/kubeflow/katib/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start
	/go/src/github.com/kubeflow/katib/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:157 +0x311

What did you expect to happen:

No panic. We should check the length of the slice.

Anything else you would like to add:
[Miscellaneous information that will assist in solving the issue.]

Environment:

  • Kubeflow version:
  • Minikube version:
  • Kubernetes version: (use kubectl version):
  • OS (e.g. from /etc/os-release):
@issue-label-bot
Copy link

Issue Label Bot is not confident enough to auto-label this issue.
See dashboard for more details.

@andreyvelich
Copy link
Member

@gaocegege How did you get this error?
From my understanding, deleteCount can't be > than len(trials): https://github.com/kubeflow/katib/blob/master/pkg/controller.v1alpha3/experiment/experiment_controller.go#L290.

We can check that deleteCount <= len(trials) before running the loop: https://github.com/kubeflow/katib/blob/master/pkg/controller.v1alpha3/experiment/experiment_controller.go#L369.

@gaocegege
Copy link
Member Author

Yeah, I am working on it. I do not really understand why we can meet it.

/assign

@gaocegege
Copy link
Member Author

gaocegege commented Jun 19, 2020

Full log is here, not sure if it is related to the webhook failure. https://paste.ubuntu.com/p/csZ3FQktmb/

@andreyvelich
Copy link
Member

@gaocegege I can't see all the latest prints in your logs.
For example: https://github.com/kubeflow/katib/blob/master/pkg/controller.v1alpha3/experiment/suggestion/suggestion.go#L43, it should be printed after you submitting the experiment.
Which version are you using?

@gaocegege
Copy link
Member Author

It is recreated about 100 times, thus no log like it.

@andreyvelich
Copy link
Member

@gaocegege Can you show experiment/experiments that you have submitted?

@gaocegege
Copy link
Member Author

spec:
  algorithm:
    algorithmName: random
    algorithmSettings: []
  maxFailedTrialCount: 0
  maxTrialCount: 4
  metricsCollectorSpec:
    collector:
      kind: StdOut
  objective:
    goal: 0
    objectiveMetricName: entropy
    type: minimize
  parallelTrialCount: 2
  parameters:
  - feasibleSpace:
      list:
      - "32"
    name: --batch_size
    parameterType: categorical
  - feasibleSpace:
      list:
      - "0.01"
    name: --learning_rate
    parameterType: categorical
  trialTemplate:
    goTemplate:
      rawTemplate: '{"kind":"TFJob","apiVersion":"kubeflow.org/v1","metadata":{"name":"{{.Trial}}","namespace":"{{.NameSpace}}","creationTimestamp":null,"labels":{"clever-mlneuron":"mln-20200618015209-rs6qd8","clever-mltask":"mltask-20200619021235-c7gfd6","experiment":"experiment-20200619021235-tzn7lk9v"},"annotations":{"caicloud.io/extended-resource-scheduler":"true","clever-training-type":"Distributed","clever-work-dir":"","clever.caicloud.io/tenant":"lyh","clever.caicloud.io/user":"admin","helm.sh/namespace":"common1","helm.sh/path":"experiment-20200619021235-tzn7lk9v","helm.sh/release":"experiment-20200619021235-tzn7lk9v"}},"spec":{"tfReplicaSpecs":{"PS":{"replicas":1,"template":{"metadata":{"creationTimestamp":null,"labels":{"experiment":"experiment-20200619021235-tzn7lk9v"},"annotations":{"caicloud.io/extended-resource-scheduler":"true","clever-training-type":"Distributed","clever-work-dir":"","clever.caicloud.io/tenant":"lyh","clever.caicloud.io/user":"admin","helm.sh/namespace":"common1","helm.sh/path":"experiment-20200619021235-tzn7lk9v","helm.sh/release":"experiment-20200619021235-tzn7lk9v"}},"spec":{"volumes":[{"name":"project-pvc-20200619021235-vj8446ql","persistentVolumeClaim":{"claimName":"common1"}}],"containers":[{"name":"tensorflow","image":"cargo.dev.caicloud.xyz/release/tf-dist-mnist-test:1.2","command":["sh","-c","
        python                /var/tf_dist_mnist/dist_mnist.py                --data_dir                /var/tf_dist_mnist/mnist-data                --train_steps                500{{-
        with .HyperParameters}} {{- range .}} {{.Name}}={{.Value}} {{- end}} {{- end}}"],"env":[{"name":"TENSORBOARD_LOG_PATH","value":"/tensorboard_logs"},{"name":"PYTHONUNBUFFERED","value":"1"},{"name":"NVIDIA_VISIBLE_DEVICES"},{"name":"TRIAL_NAME","valueFrom":{"fieldRef":{"fieldPath":"metadata.labels[''job-name'']"}}}],"resources":{"limits":{"cpu":"3","memory":"3Gi"},"requests":{"cpu":"2","memory":"2Gi"}},"volumeMounts":[{"name":"project-pvc-20200619021235-vj8446ql","mountPath":"/clever","subPath":"MLNeurons/mln-20200618015209-rs6qd8/mltask-20200619021235-c7gfd6/Runtime"},{"name":"project-pvc-20200619021235-vj8446ql","mountPath":"/tensorboard_logs","subPath":"MLNeurons/mln-20200618015209-rs6qd8/mltask-20200619021235-c7gfd6/TensorBoard"}],"lifecycle":{"postStart":{"exec":{"command":["bash","-c","mkdir
        -p /clever/output/models"]}}},"securityContext":{"allowPrivilegeEscalation":false,"procMount":"Default"}}],"schedulerName":"extended-resource-scheduler"}},"restartPolicy":"Never"},"Worker":{"replicas":1,"template":{"metadata":{"creationTimestamp":null,"labels":{"experiment":"experiment-20200619021235-tzn7lk9v"},"annotations":{"caicloud.io/extended-resource-scheduler":"true","clever-training-type":"Distributed","clever-work-dir":"","clever.caicloud.io/tenant":"lyh","clever.caicloud.io/user":"admin","helm.sh/namespace":"common1","helm.sh/path":"experiment-20200619021235-tzn7lk9v","helm.sh/release":"experiment-20200619021235-tzn7lk9v"}},"spec":{"volumes":[{"name":"project-pvc-20200619021235-vj8446ql","persistentVolumeClaim":{"claimName":"common1"}}],"containers":[{"name":"tensorflow","image":"cargo.dev.caicloud.xyz/release/tf-dist-mnist-test:1.2","command":["
        python                /var/tf_dist_mnist/dist_mnist.py                --data_dir                /var/tf_dist_mnist/mnist-data                --train_steps                500{{-
        with .HyperParameters}} {{- range .}} {{.Name}}={{.Value}} {{- end}} {{- end}}"],"env":[{"name":"TENSORBOARD_LOG_PATH","value":"/tensorboard_logs"},{"name":"PYTHONUNBUFFERED","value":"1"},{"name":"NVIDIA_VISIBLE_DEVICES"},{"name":"TRIAL_NAME","valueFrom":{"fieldRef":{"fieldPath":"metadata.labels[''job-name'']"}}}],"resources":{"limits":{"cpu":"3","memory":"3Gi"},"requests":{"cpu":"2","memory":"2Gi"}},"volumeMounts":[{"name":"project-pvc-20200619021235-vj8446ql","mountPath":"/clever","subPath":"MLNeurons/mln-20200618015209-rs6qd8/mltask-20200619021235-c7gfd6/Runtime"},{"name":"project-pvc-20200619021235-vj8446ql","mountPath":"/tensorboard_logs","subPath":"MLNeurons/mln-20200618015209-rs6qd8/mltask-20200619021235-c7gfd6/TensorBoard"}],"lifecycle":{"postStart":{"exec":{"command":["bash","-c","mkdir
        -p /clever/output/models"]}}},"securityContext":{"allowPrivilegeEscalation":false,"procMount":"Default"}}],"schedulerName":"extended-resource-scheduler"}},"restartPolicy":"Never"}}},"status":{"conditions":null,"replicaStatuses":null}}'

I think it is not caused by misconfiguration.

@johnugeorge
Copy link
Member

We have to figure out the condition that caused it. Looks like some corner case issue

@idahoakl
Copy link

I'm seeing the same issue pop up every once in a while. The logic that ReconcileTrials uses to make its decision on what to delete is based on the status of the Experiment instance. It looks like the status of the experiment is only updated if there are actually trials that are found within the cluster (https://github.com/kubeflow/katib/blob/master/pkg/controller.v1alpha3/experiment/experiment_controller.go#L266). From my reading this would cause the computed status of the Experiment to not be aligned with the underlying state. This disconnect could explain why the controller believes it needs to delete an experiment but there is no experiment within the cluster to delete.

@issue-label-bot
Copy link

Issue-Label Bot is automatically applying the labels:

Label Probability
area/katib 0.98

Please mark this comment with 👍 or 👎 to give our bot feedback!
Links: app homepage, dashboard and code for this bot.

@andreyvelich
Copy link
Member

@idahoakl Thank you for the comment.
Yes, that makes sense to me now.

Maybe if len(trialSlice) < expected, we should at least delete len(trialSlice) trials: https://github.com/kubeflow/katib/pull/1223/files#diff-1921218c9156947874b3589005049e2aR370.
What do you think @gaocegege @johnugeorge ?

@gaocegege
Copy link
Member Author

SGTM. BTW, should we fix it in v1alpha3?

@idahoakl
Copy link

Fixing it in v1alpha3 would be nice since if the bug is hit Katib is pretty much useless until manual intervention. A lot of my users have code that creates v1alpha3 objects and getting them to change over will take some time.

@gaocegege
Copy link
Member Author

@idahoakl I think so. #1223 PTAL

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants