-
Notifications
You must be signed in to change notification settings - Fork 442
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[bug] Runtime panic in v1alpha3 when deleting extra trials #1222
Comments
Issue Label Bot is not confident enough to auto-label this issue. |
@gaocegege How did you get this error? We can check that |
Yeah, I am working on it. I do not really understand why we can meet it. /assign |
Full log is here, not sure if it is related to the webhook failure. https://paste.ubuntu.com/p/csZ3FQktmb/ |
@gaocegege I can't see all the latest prints in your logs. |
It is recreated about 100 times, thus no log like it. |
@gaocegege Can you show experiment/experiments that you have submitted? |
spec:
algorithm:
algorithmName: random
algorithmSettings: []
maxFailedTrialCount: 0
maxTrialCount: 4
metricsCollectorSpec:
collector:
kind: StdOut
objective:
goal: 0
objectiveMetricName: entropy
type: minimize
parallelTrialCount: 2
parameters:
- feasibleSpace:
list:
- "32"
name: --batch_size
parameterType: categorical
- feasibleSpace:
list:
- "0.01"
name: --learning_rate
parameterType: categorical
trialTemplate:
goTemplate:
rawTemplate: '{"kind":"TFJob","apiVersion":"kubeflow.org/v1","metadata":{"name":"{{.Trial}}","namespace":"{{.NameSpace}}","creationTimestamp":null,"labels":{"clever-mlneuron":"mln-20200618015209-rs6qd8","clever-mltask":"mltask-20200619021235-c7gfd6","experiment":"experiment-20200619021235-tzn7lk9v"},"annotations":{"caicloud.io/extended-resource-scheduler":"true","clever-training-type":"Distributed","clever-work-dir":"","clever.caicloud.io/tenant":"lyh","clever.caicloud.io/user":"admin","helm.sh/namespace":"common1","helm.sh/path":"experiment-20200619021235-tzn7lk9v","helm.sh/release":"experiment-20200619021235-tzn7lk9v"}},"spec":{"tfReplicaSpecs":{"PS":{"replicas":1,"template":{"metadata":{"creationTimestamp":null,"labels":{"experiment":"experiment-20200619021235-tzn7lk9v"},"annotations":{"caicloud.io/extended-resource-scheduler":"true","clever-training-type":"Distributed","clever-work-dir":"","clever.caicloud.io/tenant":"lyh","clever.caicloud.io/user":"admin","helm.sh/namespace":"common1","helm.sh/path":"experiment-20200619021235-tzn7lk9v","helm.sh/release":"experiment-20200619021235-tzn7lk9v"}},"spec":{"volumes":[{"name":"project-pvc-20200619021235-vj8446ql","persistentVolumeClaim":{"claimName":"common1"}}],"containers":[{"name":"tensorflow","image":"cargo.dev.caicloud.xyz/release/tf-dist-mnist-test:1.2","command":["sh","-c","
python /var/tf_dist_mnist/dist_mnist.py --data_dir /var/tf_dist_mnist/mnist-data --train_steps 500{{-
with .HyperParameters}} {{- range .}} {{.Name}}={{.Value}} {{- end}} {{- end}}"],"env":[{"name":"TENSORBOARD_LOG_PATH","value":"/tensorboard_logs"},{"name":"PYTHONUNBUFFERED","value":"1"},{"name":"NVIDIA_VISIBLE_DEVICES"},{"name":"TRIAL_NAME","valueFrom":{"fieldRef":{"fieldPath":"metadata.labels[''job-name'']"}}}],"resources":{"limits":{"cpu":"3","memory":"3Gi"},"requests":{"cpu":"2","memory":"2Gi"}},"volumeMounts":[{"name":"project-pvc-20200619021235-vj8446ql","mountPath":"/clever","subPath":"MLNeurons/mln-20200618015209-rs6qd8/mltask-20200619021235-c7gfd6/Runtime"},{"name":"project-pvc-20200619021235-vj8446ql","mountPath":"/tensorboard_logs","subPath":"MLNeurons/mln-20200618015209-rs6qd8/mltask-20200619021235-c7gfd6/TensorBoard"}],"lifecycle":{"postStart":{"exec":{"command":["bash","-c","mkdir
-p /clever/output/models"]}}},"securityContext":{"allowPrivilegeEscalation":false,"procMount":"Default"}}],"schedulerName":"extended-resource-scheduler"}},"restartPolicy":"Never"},"Worker":{"replicas":1,"template":{"metadata":{"creationTimestamp":null,"labels":{"experiment":"experiment-20200619021235-tzn7lk9v"},"annotations":{"caicloud.io/extended-resource-scheduler":"true","clever-training-type":"Distributed","clever-work-dir":"","clever.caicloud.io/tenant":"lyh","clever.caicloud.io/user":"admin","helm.sh/namespace":"common1","helm.sh/path":"experiment-20200619021235-tzn7lk9v","helm.sh/release":"experiment-20200619021235-tzn7lk9v"}},"spec":{"volumes":[{"name":"project-pvc-20200619021235-vj8446ql","persistentVolumeClaim":{"claimName":"common1"}}],"containers":[{"name":"tensorflow","image":"cargo.dev.caicloud.xyz/release/tf-dist-mnist-test:1.2","command":["
python /var/tf_dist_mnist/dist_mnist.py --data_dir /var/tf_dist_mnist/mnist-data --train_steps 500{{-
with .HyperParameters}} {{- range .}} {{.Name}}={{.Value}} {{- end}} {{- end}}"],"env":[{"name":"TENSORBOARD_LOG_PATH","value":"/tensorboard_logs"},{"name":"PYTHONUNBUFFERED","value":"1"},{"name":"NVIDIA_VISIBLE_DEVICES"},{"name":"TRIAL_NAME","valueFrom":{"fieldRef":{"fieldPath":"metadata.labels[''job-name'']"}}}],"resources":{"limits":{"cpu":"3","memory":"3Gi"},"requests":{"cpu":"2","memory":"2Gi"}},"volumeMounts":[{"name":"project-pvc-20200619021235-vj8446ql","mountPath":"/clever","subPath":"MLNeurons/mln-20200618015209-rs6qd8/mltask-20200619021235-c7gfd6/Runtime"},{"name":"project-pvc-20200619021235-vj8446ql","mountPath":"/tensorboard_logs","subPath":"MLNeurons/mln-20200618015209-rs6qd8/mltask-20200619021235-c7gfd6/TensorBoard"}],"lifecycle":{"postStart":{"exec":{"command":["bash","-c","mkdir
-p /clever/output/models"]}}},"securityContext":{"allowPrivilegeEscalation":false,"procMount":"Default"}}],"schedulerName":"extended-resource-scheduler"}},"restartPolicy":"Never"}}},"status":{"conditions":null,"replicaStatuses":null}}' I think it is not caused by misconfiguration. |
We have to figure out the condition that caused it. Looks like some corner case issue |
I'm seeing the same issue pop up every once in a while. The logic that |
Issue-Label Bot is automatically applying the labels:
Please mark this comment with 👍 or 👎 to give our bot feedback! |
@idahoakl Thank you for the comment. Maybe if |
SGTM. BTW, should we fix it in v1alpha3? |
Fixing it in v1alpha3 would be nice since if the bug is hit Katib is pretty much useless until manual intervention. A lot of my users have code that creates v1alpha3 objects and getting them to change over will take some time. |
/kind bug
What steps did you take and what happened:
[A clear and concise description of what the bug is.]
What did you expect to happen:
No panic. We should check the length of the slice.
Anything else you would like to add:
[Miscellaneous information that will assist in solving the issue.]
Environment:
kubectl version
):/etc/os-release
):The text was updated successfully, but these errors were encountered: