[bug] Runtime panic in v1alpha3 when deleting extra trials #1222

gaocegege · 2020-06-18T04:15:21Z

/kind bug

What steps did you take and what happened:
[A clear and concise description of what the bug is.]

{"level":"info","ts":1592453260.8852422,"logger":"experiment-controller","msg":"DeleteTrials","Experiment":"common1/experiment-20200618015209-kbs4gtfq","deleteCount":1}
E0618 04:07:40.885467       1 runtime.go:69] Observed a panic: "index out of range" (runtime error: index out of range)
/go/src/github.com/kubeflow/katib/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:76
/go/src/github.com/kubeflow/katib/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:65
/go/src/github.com/kubeflow/katib/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:51
/usr/local/go/src/runtime/panic.go:522
/usr/local/go/src/runtime/panic.go:44
/go/src/github.com/kubeflow/katib/pkg/controller.v1alpha3/experiment/experiment_controller.go:359
/go/src/github.com/kubeflow/katib/pkg/controller.v1alpha3/experiment/experiment_controller.go:283
/go/src/github.com/kubeflow/katib/pkg/controller.v1alpha3/experiment/experiment_controller.go:264
/go/src/github.com/kubeflow/katib/pkg/controller.v1alpha3/experiment/experiment_controller.go:222
/go/src/github.com/kubeflow/katib/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:215
/go/src/github.com/kubeflow/katib/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:158
/go/src/github.com/kubeflow/katib/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133
/go/src/github.com/kubeflow/katib/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:134
/go/src/github.com/kubeflow/katib/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88
/usr/local/go/src/runtime/asm_amd64.s:1337
panic: runtime error: index out of range [recovered]
	panic: runtime error: index out of range

goroutine 292 [running]:
github.com/kubeflow/katib/vendor/k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
	/go/src/github.com/kubeflow/katib/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:58 +0x105
panic(0x1547f40, 0x27efa20)
	/usr/local/go/src/runtime/panic.go:522 +0x1b5
github.com/kubeflow/katib/pkg/controller.v1alpha3/experiment.(*ReconcileExperiment).deleteTrials(0xc0008689c0, 0xc000e62000, 0x2827da8, 0x0, 0x0, 0x1, 0xc000d61200, 0x11)
	/go/src/github.com/kubeflow/katib/pkg/controller.v1alpha3/experiment/experiment_controller.go:359 +0x374
github.com/kubeflow/katib/pkg/controller.v1alpha3/experiment.(*ReconcileExperiment).ReconcileTrials(0xc0008689c0, 0xc000e62000, 0x2827da8, 0x0, 0x0, 0xc000401260, 0x0)
	/go/src/github.com/kubeflow/katib/pkg/controller.v1alpha3/experiment/experiment_controller.go:283 +0x29b
github.com/kubeflow/katib/pkg/controller.v1alpha3/experiment.(*ReconcileExperiment).ReconcileExperiment(0xc0008689c0, 0xc000e62000, 0x2827da8, 0x0)
	/go/src/github.com/kubeflow/katib/pkg/controller.v1alpha3/experiment/experiment_controller.go:264 +0x37f
github.com/kubeflow/katib/pkg/controller.v1alpha3/experiment.(*ReconcileExperiment).Reconcile(0xc0008689c0, 0xc000d5b009, 0x7, 0xc000d51680, 0x22, 0x2809e00, 0x0, 0x0, 0x0)
	/go/src/github.com/kubeflow/katib/pkg/controller.v1alpha3/experiment/experiment_controller.go:222 +0x50a
github.com/kubeflow/katib/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc0007312c0, 0x0)
	/go/src/github.com/kubeflow/katib/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:215 +0x1cc
github.com/kubeflow/katib/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1()
	/go/src/github.com/kubeflow/katib/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:158 +0x36
github.com/kubeflow/katib/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1(0xc000ab75f0)
	/go/src/github.com/kubeflow/katib/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133 +0x54
github.com/kubeflow/katib/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc000ab75f0, 0x3b9aca00, 0x0, 0x1, 0xc0007254a0)
	/go/src/github.com/kubeflow/katib/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:134 +0xf8
github.com/kubeflow/katib/vendor/k8s.io/apimachinery/pkg/util/wait.Until(0xc000ab75f0, 0x3b9aca00, 0xc0007254a0)
	/go/src/github.com/kubeflow/katib/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88 +0x4d
created by github.com/kubeflow/katib/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start
	/go/src/github.com/kubeflow/katib/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:157 +0x311

What did you expect to happen:

No panic. We should check the length of the slice.

Anything else you would like to add:
[Miscellaneous information that will assist in solving the issue.]

Environment:

Kubeflow version:
Minikube version:
Kubernetes version: (use kubectl version):
OS (e.g. from /etc/os-release):

The text was updated successfully, but these errors were encountered:

issue-label-bot · 2020-06-18T04:15:28Z

Issue Label Bot is not confident enough to auto-label this issue.
See dashboard for more details.

andreyvelich · 2020-06-18T18:21:46Z

@gaocegege How did you get this error?
From my understanding, deleteCount can't be > than len(trials): https://github.com/kubeflow/katib/blob/master/pkg/controller.v1alpha3/experiment/experiment_controller.go#L290.

We can check that deleteCount <= len(trials) before running the loop: https://github.com/kubeflow/katib/blob/master/pkg/controller.v1alpha3/experiment/experiment_controller.go#L369.

gaocegege · 2020-06-19T02:07:29Z

Yeah, I am working on it. I do not really understand why we can meet it.

/assign

gaocegege · 2020-06-19T02:25:58Z

Full log is here, not sure if it is related to the webhook failure. https://paste.ubuntu.com/p/csZ3FQktmb/

andreyvelich · 2020-06-19T10:46:59Z

@gaocegege I can't see all the latest prints in your logs.
For example: https://github.com/kubeflow/katib/blob/master/pkg/controller.v1alpha3/experiment/suggestion/suggestion.go#L43, it should be printed after you submitting the experiment.
Which version are you using?

gaocegege · 2020-06-19T10:48:06Z

It is recreated about 100 times, thus no log like it.

andreyvelich · 2020-06-19T11:02:51Z

@gaocegege Can you show experiment/experiments that you have submitted?

gaocegege · 2020-06-19T11:05:42Z

spec:
  algorithm:
    algorithmName: random
    algorithmSettings: []
  maxFailedTrialCount: 0
  maxTrialCount: 4
  metricsCollectorSpec:
    collector:
      kind: StdOut
  objective:
    goal: 0
    objectiveMetricName: entropy
    type: minimize
  parallelTrialCount: 2
  parameters:
  - feasibleSpace:
      list:
      - "32"
    name: --batch_size
    parameterType: categorical
  - feasibleSpace:
      list:
      - "0.01"
    name: --learning_rate
    parameterType: categorical
  trialTemplate:
    goTemplate:
      rawTemplate: '{"kind":"TFJob","apiVersion":"kubeflow.org/v1","metadata":{"name":"{{.Trial}}","namespace":"{{.NameSpace}}","creationTimestamp":null,"labels":{"clever-mlneuron":"mln-20200618015209-rs6qd8","clever-mltask":"mltask-20200619021235-c7gfd6","experiment":"experiment-20200619021235-tzn7lk9v"},"annotations":{"caicloud.io/extended-resource-scheduler":"true","clever-training-type":"Distributed","clever-work-dir":"","clever.caicloud.io/tenant":"lyh","clever.caicloud.io/user":"admin","helm.sh/namespace":"common1","helm.sh/path":"experiment-20200619021235-tzn7lk9v","helm.sh/release":"experiment-20200619021235-tzn7lk9v"}},"spec":{"tfReplicaSpecs":{"PS":{"replicas":1,"template":{"metadata":{"creationTimestamp":null,"labels":{"experiment":"experiment-20200619021235-tzn7lk9v"},"annotations":{"caicloud.io/extended-resource-scheduler":"true","clever-training-type":"Distributed","clever-work-dir":"","clever.caicloud.io/tenant":"lyh","clever.caicloud.io/user":"admin","helm.sh/namespace":"common1","helm.sh/path":"experiment-20200619021235-tzn7lk9v","helm.sh/release":"experiment-20200619021235-tzn7lk9v"}},"spec":{"volumes":[{"name":"project-pvc-20200619021235-vj8446ql","persistentVolumeClaim":{"claimName":"common1"}}],"containers":[{"name":"tensorflow","image":"cargo.dev.caicloud.xyz/release/tf-dist-mnist-test:1.2","command":["sh","-c","
        python                /var/tf_dist_mnist/dist_mnist.py                --data_dir                /var/tf_dist_mnist/mnist-data                --train_steps                500{{-
        with .HyperParameters}} {{- range .}} {{.Name}}={{.Value}} {{- end}} {{- end}}"],"env":[{"name":"TENSORBOARD_LOG_PATH","value":"/tensorboard_logs"},{"name":"PYTHONUNBUFFERED","value":"1"},{"name":"NVIDIA_VISIBLE_DEVICES"},{"name":"TRIAL_NAME","valueFrom":{"fieldRef":{"fieldPath":"metadata.labels[''job-name'']"}}}],"resources":{"limits":{"cpu":"3","memory":"3Gi"},"requests":{"cpu":"2","memory":"2Gi"}},"volumeMounts":[{"name":"project-pvc-20200619021235-vj8446ql","mountPath":"/clever","subPath":"MLNeurons/mln-20200618015209-rs6qd8/mltask-20200619021235-c7gfd6/Runtime"},{"name":"project-pvc-20200619021235-vj8446ql","mountPath":"/tensorboard_logs","subPath":"MLNeurons/mln-20200618015209-rs6qd8/mltask-20200619021235-c7gfd6/TensorBoard"}],"lifecycle":{"postStart":{"exec":{"command":["bash","-c","mkdir
        -p /clever/output/models"]}}},"securityContext":{"allowPrivilegeEscalation":false,"procMount":"Default"}}],"schedulerName":"extended-resource-scheduler"}},"restartPolicy":"Never"},"Worker":{"replicas":1,"template":{"metadata":{"creationTimestamp":null,"labels":{"experiment":"experiment-20200619021235-tzn7lk9v"},"annotations":{"caicloud.io/extended-resource-scheduler":"true","clever-training-type":"Distributed","clever-work-dir":"","clever.caicloud.io/tenant":"lyh","clever.caicloud.io/user":"admin","helm.sh/namespace":"common1","helm.sh/path":"experiment-20200619021235-tzn7lk9v","helm.sh/release":"experiment-20200619021235-tzn7lk9v"}},"spec":{"volumes":[{"name":"project-pvc-20200619021235-vj8446ql","persistentVolumeClaim":{"claimName":"common1"}}],"containers":[{"name":"tensorflow","image":"cargo.dev.caicloud.xyz/release/tf-dist-mnist-test:1.2","command":["
        python                /var/tf_dist_mnist/dist_mnist.py                --data_dir                /var/tf_dist_mnist/mnist-data                --train_steps                500{{-
        with .HyperParameters}} {{- range .}} {{.Name}}={{.Value}} {{- end}} {{- end}}"],"env":[{"name":"TENSORBOARD_LOG_PATH","value":"/tensorboard_logs"},{"name":"PYTHONUNBUFFERED","value":"1"},{"name":"NVIDIA_VISIBLE_DEVICES"},{"name":"TRIAL_NAME","valueFrom":{"fieldRef":{"fieldPath":"metadata.labels[''job-name'']"}}}],"resources":{"limits":{"cpu":"3","memory":"3Gi"},"requests":{"cpu":"2","memory":"2Gi"}},"volumeMounts":[{"name":"project-pvc-20200619021235-vj8446ql","mountPath":"/clever","subPath":"MLNeurons/mln-20200618015209-rs6qd8/mltask-20200619021235-c7gfd6/Runtime"},{"name":"project-pvc-20200619021235-vj8446ql","mountPath":"/tensorboard_logs","subPath":"MLNeurons/mln-20200618015209-rs6qd8/mltask-20200619021235-c7gfd6/TensorBoard"}],"lifecycle":{"postStart":{"exec":{"command":["bash","-c","mkdir
        -p /clever/output/models"]}}},"securityContext":{"allowPrivilegeEscalation":false,"procMount":"Default"}}],"schedulerName":"extended-resource-scheduler"}},"restartPolicy":"Never"}}},"status":{"conditions":null,"replicaStatuses":null}}'

I think it is not caused by misconfiguration.

johnugeorge · 2020-06-19T14:44:57Z

We have to figure out the condition that caused it. Looks like some corner case issue

idahoakl · 2020-07-27T17:15:30Z

I'm seeing the same issue pop up every once in a while. The logic that ReconcileTrials uses to make its decision on what to delete is based on the status of the Experiment instance. It looks like the status of the experiment is only updated if there are actually trials that are found within the cluster (https://github.com/kubeflow/katib/blob/master/pkg/controller.v1alpha3/experiment/experiment_controller.go#L266). From my reading this would cause the computed status of the Experiment to not be aligned with the underlying state. This disconnect could explain why the controller believes it needs to delete an experiment but there is no experiment within the cluster to delete.

issue-label-bot · 2020-07-27T17:15:38Z

Issue-Label Bot is automatically applying the labels:

Label	Probability
area/katib	0.98

Please mark this comment with 👍 or 👎 to give our bot feedback!
Links: app homepage, dashboard and code for this bot.

andreyvelich · 2020-07-27T18:20:55Z

@idahoakl Thank you for the comment.
Yes, that makes sense to me now.

Maybe if len(trialSlice) < expected, we should at least delete len(trialSlice) trials: https://github.com/kubeflow/katib/pull/1223/files#diff-1921218c9156947874b3589005049e2aR370.
What do you think @gaocegege @johnugeorge ?

gaocegege · 2020-07-28T01:16:27Z

SGTM. BTW, should we fix it in v1alpha3?

idahoakl · 2020-07-28T02:11:21Z

Fixing it in v1alpha3 would be nice since if the bug is hit Katib is pretty much useless until manual intervention. A lot of my users have code that creates v1alpha3 objects and getting them to change over will take some time.

gaocegege · 2020-07-28T02:19:30Z

@idahoakl I think so. #1223 PTAL

k8s-ci-robot added the kind/bug label Jun 18, 2020

k8s-ci-robot assigned gaocegege Jun 19, 2020

gaocegege mentioned this issue Jun 19, 2020

feat(experiment): Add a check before deletion #1223

Merged

issue-label-bot bot added the area/katib label Jul 27, 2020

k8s-ci-robot closed this as completed in #1223 Jul 31, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[bug] Runtime panic in v1alpha3 when deleting extra trials #1222

[bug] Runtime panic in v1alpha3 when deleting extra trials #1222

gaocegege commented Jun 18, 2020

issue-label-bot bot commented Jun 18, 2020

andreyvelich commented Jun 18, 2020

gaocegege commented Jun 19, 2020

gaocegege commented Jun 19, 2020 •

edited

Loading

andreyvelich commented Jun 19, 2020

gaocegege commented Jun 19, 2020

andreyvelich commented Jun 19, 2020

gaocegege commented Jun 19, 2020

johnugeorge commented Jun 19, 2020

idahoakl commented Jul 27, 2020

issue-label-bot bot commented Jul 27, 2020

andreyvelich commented Jul 27, 2020

gaocegege commented Jul 28, 2020

idahoakl commented Jul 28, 2020

gaocegege commented Jul 28, 2020

[bug] Runtime panic in v1alpha3 when deleting extra trials #1222

[bug] Runtime panic in v1alpha3 when deleting extra trials #1222

Comments

gaocegege commented Jun 18, 2020

issue-label-bot bot commented Jun 18, 2020

andreyvelich commented Jun 18, 2020

gaocegege commented Jun 19, 2020

gaocegege commented Jun 19, 2020 • edited Loading

andreyvelich commented Jun 19, 2020

gaocegege commented Jun 19, 2020

andreyvelich commented Jun 19, 2020

gaocegege commented Jun 19, 2020

johnugeorge commented Jun 19, 2020

idahoakl commented Jul 27, 2020

issue-label-bot bot commented Jul 27, 2020

andreyvelich commented Jul 27, 2020

gaocegege commented Jul 28, 2020

idahoakl commented Jul 28, 2020

gaocegege commented Jul 28, 2020

gaocegege commented Jun 19, 2020 •

edited

Loading