AutoML jobs times out in pipeline until katib-controller is restarted #459

nobuto-m · 2022-06-05T15:48:55Z

How to reproduce (based on the quickstart doc):

sudo snap install microk8s --classic --channel 1.21
microk8s enable dns storage ingress metallb:10.64.140.43-10.64.140.49
deploy kubeflow:
juju bootstrap microk8s
juju add-model kubeflow
juju deploy --trust kubeflow
juju config dex-auth public-url=http://10.64.140.43.nip.io
juju config oidc-gatekeeper public-url=http://10.64.140.43.nip.io
juju config dex-auth static-username=admin
juju config dex-auth static-password=admin
juju refresh kfp-profile-controller --channel edge (to apply access-ml-pipeline: "true" for PodDefault)
Create a Jupyter notebook with Allow access to Kubeflow Pipelines in Kubeflow UI
Import an example kubeflow pipeline notebook and run it from the top:
https://github.com/kubeflow/katib/blob/master/examples/v1beta1/kubeflow-pipelines/kubeflow-e2e-mnist.ipynb
An experiment (AutoML) job is created as a part of the pipeline, but no trial is complete and the run eventually hits timeout (60min)

Expected:
All trials are complete.

Solution/workaround:

Once Workflow.v1alpha1.argoproj.io is added to trial-resources of katib-controller. All trials are complete.
https://github.com/kubeflow/katib/blob/master/examples/v1beta1/argo/README.md#katib-controller
Since both katib-controller and argo are managed by Juju charms, it would be good to see some improvements in this user scenario. For the record, Katib Metrics Collector sidecar injection is enabled out of the box.

OR

Simply restart katib-controller without changing anything. Adding Workflow.v1alpha1.argoproj.io might be a red herring since it actually recreate a pod.

kubectl -n kubeflow rollout restart deployment/katib-controller

$ microk8s kubectl get namespace admin -o yaml
apiVersion: v1
kind: Namespace
metadata:
  annotations:
    owner: admin
  creationTimestamp: "2022-06-05T14:29:50Z"
  labels:
    app.kubernetes.io/part-of: kubeflow-profile
    istio-injection: enabled
    katib-metricscollector-injection: enabled
    katib.kubeflow.org/metrics-collector-injection: enabled
    kubernetes.io/metadata.name: admin
    pipelines.kubeflow.org/enabled: "true"
    serving.kubeflow.org/inferenceservice: enabled
...

[out of the box]

$ pgrep -af katib-controller
54059 /bin/sh -c export JUJU_DATA_DIR=/var/lib/juju export JUJU_TOOLS_DIR=$JUJU_DATA_DIR/tools  mkdir -p $JUJU_TOOLS_DIR cp /opt/jujud $JUJU_TOOLS_DIR/jujud  $JUJU_TOOLS_DIR/jujud caasoperator --application-name=katib-controller --debug 
54095 /var/lib/juju/tools/jujud caasoperator --application-name=katib-controller --debug
80781 ./katib-controller --webhook-port=443 --trial-resources=Job.v1.batch --trial-resources=TFJob.v1.kubeflow.org --trial-resources=PyTorchJob.v1.kubeflow.org --trial-resources=MPIJob.v1.kubeflow.org --trial-resources=PipelineRun.v1beta1.tekton.dev

[patched]

$ kubectl patch Deployment katib-controller -n kubeflow --type=json \
  -p='[{"op": "add", "path": "/spec/template/spec/containers/0/args/-", "value": "--trial-resources=Workflow.v1alpha1.argoproj.io"}]'

$ pgrep -af katib-controller
54059 /bin/sh -c export JUJU_DATA_DIR=/var/lib/juju export JUJU_TOOLS_DIR=$JUJU_DATA_DIR/tools  mkdir -p $JUJU_TOOLS_DIR cp /opt/jujud $JUJU_TOOLS_DIR/jujud  $JUJU_TOOLS_DIR/jujud caasoperator --application-name=katib-controller --debug 
54095 /var/lib/juju/tools/jujud caasoperator --application-name=katib-controller --debug
808728 ./katib-controller --webhook-port=443 --trial-resources=Job.v1.batch --trial-resources=TFJob.v1.kubeflow.org --trial-resources=PyTorchJob.v1.kubeflow.org --trial-resources=MPIJob.v1.kubeflow.org --trial-resources=PipelineRun.v1beta1.tekton.dev --trial-resources=Workflow.v1alpha1.argoproj.io

The text was updated successfully, but these errors were encountered:

natalian98 · 2022-06-07T11:50:05Z

I tried to reproduce your issue but in my case the trials get completed without patching the katib deployment:

The run fails due to kfserving being unavailable, but there seems to be no issue with katib.

Could you share your notebook configuration (e.g. image, cpu, ram)?

nobuto-m · 2022-06-07T11:57:19Z

I tried to reproduce your issue but in my case the trials get completed without patching the katib deployment:

Hmm, interesting. I had the timeout 100% so far.

The run fails due to kfserving being unavailable, but there seems to be no issue with katib.

Yup, kfserving failure is expected.

Could you share your notebook configuration (e.g. image, cpu, ram)?

I don't think of anything off the top of my head. I just used the default values and it shouldn't affect the pipeline execution since it just creates a pipeline not running it if I'm not mistaken.

Just to confirm, have you used microk8s?

natalian98 · 2022-06-07T12:38:29Z

Just to confirm, have you used microk8s?

Yes, I used microk8s 1.21/stable.

nobuto-m · 2022-06-07T14:09:15Z

Hmm, trial-resources=Workflow.v1alpha1.argoproj.io might be a red herring. The issue is still reproducible in my environment, but after adding trial-resources=Workflow.v1alpha1.argoproj.io and removing it again, the trial jobs complete.

So the key to workaround the issue might be restarting/recreating the katib-controller pod. It's puzzling why it's reproducible on my testbed but not on the other environment though.

nobuto-m · 2022-06-07T15:41:18Z

So the key to workaround the issue might be restarting/recreating the katib-controller pod. It's puzzling why it's reproducible on my testbed but not on the other environment though.

microk8s kubectl -n kubeflow rollout restart deployment/katib-controller does the trick to unstuck not completing trials somehow.

The controller was restarted around 15:31.

2022-06-07T15:31:48.597347405Z stderr F 2022-06-07 15:31:48 WARNING juju.worker.caasoperator caasoperator.go:554 stopping uniter for dead unit "katib-controller/0": worker "katib-controller/0" not found

containers.log

nobuto-m · 2022-06-09T04:20:46Z

I've reproduced it successfully in a clean environment. The steps are almost identical with the one in the description. Hope it helps for you to reproduce it on your end, @natalian98

Launch an AWS instance with the following config
- Instance type: t3.2xlarge
- AMI ID: ami-0a3eb6ca097b78895
- AMI name: ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20220419
- storage: 64GB
Create a group in advance

sudo addgroup --system microk8s
sudo adduser $USER microk8s

logout and login again
Run a script

git clone https://github.com/nobuto-m/quick-kubeflow.git
cd quick-kubeflow
git checkout 376c501
time ./redeploy-microk8s-kubeflow.sh
## -> 32 min

Connect from local laptop/desktop

sshuttle -r ubuntu@PUBLIC_IP_OF_AWS_INSTANCE 10.64.140.43

Then open
http://10.64.140.43.nip.io/

Create a Jupyter notebook instance
- name: first-notebook
- image: j1r0q0g6/notebooks/notebook-servers/jupyter-scipy:v1.4
- configuration: Allow access to Kubeflow Pipelines - ✓
Import and run the notebook
https://raw.githubusercontent.com/kubeflow/katib/fe2ae99d5b8c58a0f56221bb9a58afc131bfafc4/examples/v1beta1/kubeflow-pipelines/kubeflow-e2e-mnist.ipynb

natalian98 · 2022-06-10T16:03:33Z

Thanks for providing the detailed steps @nobuto-m.

I deployed kubeflow once more using your instructions and the trials still succeed, provided that the notebook's CPU and memory values are increased.

When using the default CPU==0.5 and memory==1Gi, some garbage collection errors can be observed, which may be the reason why the trials get stuck. Katib-controller wasn't restarted.

After creating a notebook with 2 CPUs and 4Gi memory, the trials succeed.

nobuto-m · 2022-06-13T06:06:11Z

I deployed kubeflow once more using your instructions and the trials still succeed, provided that the notebook's CPU and memory values are increased.

When using the default CPU==0.5 and memory==1Gi, some garbage collection errors can be observed, which may be the reason why the trials get stuck. Katib-controller wasn't restarted.

After creating a notebook with 2 CPUs and 4Gi memory, the trials succeed.

Thanks for testing. I bumped CPU and memory to 4 CPUs and 8Gi memory, but still it doesn't work for me.

Also, I'm confused because I thought a notebook instance was irrelevant to the pipeline run. The pipeline is defined from the notebook, but the notebook instance can be deleted before re-running the pipeline if I'm not mistaken. So I'm wondering how the spec of the notebook instance affects the pipeline. Am I missing something?

some garbage collection errors can be observed

Which log did you see this in? It might not be the notebook instance, but if there was any error from any component, we can dig in.

agathanatasha · 2022-07-29T18:28:36Z

I am experiencing similar failure. Katib-controller would deploy successfully. The unit would goes into crashloopbackoff after submitting an autoML experiment with all the default configurations (pod logs provided below).
Patching the katib-controller deployment with trial resources seems to resolve that error. Restarting the deployment doesn't work for me.

logs from katib-controller

ubuntu@ip-172-31-26-151:~$ uk -n kubeflow logs katib-controller-f5f96cfdd-z6pbd 
{"level":"info","ts":1659111913.112697,"logger":"entrypoint","msg":"Config:","experiment-suggestion-name":"default","webhook-port":443,"metrics-addr":":8080","inject-security-context":false,"enable-grpc-probe-in-suggestion":true,"trial-resources":[{"Group":"batch","Version":"v1","Kind":"Job"},{"Group":"kubeflow.org","Version":"v1","Kind":"TFJob"},{"Group":"kubeflow.org","Version":"v1","Kind":"PyTorchJob"},{"Group":"kubeflow.org","Version":"v1","Kind":"MPIJob"},{"Group":"tekton.dev","Version":"v1beta1","Kind":"PipelineRun"}]}
I0729 16:25:14.165000       1 request.go:655] Throttling request took 1.040866117s, request: GET:https://10.152.183.1:443/apis/admissionregistration.k8s.io/v1?timeout=32s
{"level":"info","ts":1659111914.5194237,"logger":"controller-runtime.metrics","msg":"metrics server is starting to listen","addr":":8080"}
{"level":"info","ts":1659111914.5197005,"logger":"entrypoint","msg":"Registering Components."}
{"level":"info","ts":1659111914.5198681,"logger":"entrypoint","msg":"Setting up controller."}
{"level":"info","ts":1659111914.5198922,"logger":"experiment-controller","msg":"Using the default suggestion implementation"}
{"level":"info","ts":1659111914.5199733,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"}
{"level":"info","ts":1659111914.5200028,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"}
{"level":"info","ts":1659111914.5200174,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"}
{"level":"info","ts":1659111914.5200253,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"}
{"level":"info","ts":1659111914.5200374,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"}
{"level":"info","ts":1659111914.520045,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"}
{"level":"info","ts":1659111914.5200517,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"}
{"level":"info","ts":1659111914.5200558,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"}
{"level":"info","ts":1659111914.5200624,"logger":"experiment-controller","msg":"Experiment controller created"}
{"level":"info","ts":1659111914.5200937,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"}
{"level":"info","ts":1659111914.520104,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"}
{"level":"info","ts":1659111914.5201106,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"}
{"level":"info","ts":1659111914.5201147,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"}
{"level":"info","ts":1659111914.5201201,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"}
{"level":"info","ts":1659111914.5201275,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"}
{"level":"info","ts":1659111914.5201344,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"}
{"level":"info","ts":1659111914.5201378,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"}
{"level":"info","ts":1659111914.5201442,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"}
{"level":"info","ts":1659111914.5201483,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"}
{"level":"info","ts":1659111914.5201523,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"}
{"level":"info","ts":1659111914.5201564,"logger":"suggestion-controller","msg":"Suggestion controller created"}
{"level":"info","ts":1659111914.5202193,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"}
{"level":"info","ts":1659111914.5202289,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"}
{"level":"info","ts":1659111914.520279,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"}
{"level":"info","ts":1659111914.520288,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"}
{"level":"info","ts":1659111914.5203009,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"}
{"level":"info","ts":1659111914.5203068,"logger":"trial-controller","msg":"Job watch added successfully","CRD Group":"batch","CRD Version":"v1","CRD Kind":"Job"}
{"level":"info","ts":1659111914.5203474,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"}
{"level":"info","ts":1659111914.5203562,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"}
{"level":"info","ts":1659111914.5203605,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"}
{"level":"info","ts":1659111914.5203655,"logger":"trial-controller","msg":"Job watch added successfully","CRD Group":"kubeflow.org","CRD Version":"v1","CRD Kind":"TFJob"}
{"level":"info","ts":1659111914.5203984,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"}
{"level":"info","ts":1659111914.5204074,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"}
{"level":"info","ts":1659111914.5204117,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"}
{"level":"info","ts":1659111914.520415,"logger":"trial-controller","msg":"Job watch added successfully","CRD Group":"kubeflow.org","CRD Version":"v1","CRD Kind":"PyTorchJob"}
{"level":"info","ts":1659111917.416891,"logger":"trial-controller","msg":"Job watch error. CRD might be missing. Please install CRD and restart katib-controller","CRD Group":"kubeflow.org","CRD Version":"v1","CRD Kind":"MPIJob"}
{"level":"info","ts":1659111920.3188462,"logger":"trial-controller","msg":"Job watch error. CRD might be missing. Please install CRD and restart katib-controller","CRD Group":"tekton.dev","CRD Version":"v1beta1","CRD Kind":"PipelineRun"}
{"level":"info","ts":1659111920.318882,"logger":"trial-controller","msg":"Trial controller created"}
{"level":"info","ts":1659111920.3188865,"logger":"entrypoint","msg":"Setting up webhooks."}
{"level":"info","ts":1659111920.3188999,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"}
{"level":"info","ts":1659111920.3190033,"logger":"controller-runtime.webhook","msg":"registering webhook","path":"/validate-experiment"}
{"level":"info","ts":1659111920.31902,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"}
{"level":"info","ts":1659111920.3190272,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"}
{"level":"info","ts":1659111920.3190315,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"}
{"level":"info","ts":1659111920.3190677,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"}
{"level":"info","ts":1659111920.3190932,"logger":"controller-runtime.webhook","msg":"registering webhook","path":"/mutate-experiment"}
{"level":"info","ts":1659111920.3190963,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"}
{"level":"info","ts":1659111920.3191006,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"}
{"level":"info","ts":1659111920.319103,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"}
{"level":"info","ts":1659111920.3191202,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"}
{"level":"info","ts":1659111920.3191442,"logger":"controller-runtime.webhook","msg":"registering webhook","path":"/mutate-pod"}
{"level":"info","ts":1659111920.319147,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"}
{"level":"info","ts":1659111920.319153,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"}
{"level":"info","ts":1659111920.3191555,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"}
{"level":"info","ts":1659111920.3191712,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"}
{"level":"info","ts":1659111920.3191738,"logger":"entrypoint","msg":"Starting the Cmd."}
{"level":"info","ts":1659111920.3196013,"logger":"controller-runtime.manager","msg":"starting metrics server","path":"/metrics"}
{"level":"info","ts":1659111920.3195987,"logger":"controller-runtime.webhook.webhooks","msg":"starting webhook server"}
{"level":"info","ts":1659111920.3198805,"logger":"controller-runtime.certwatcher","msg":"Updated current TLS certificate"}
{"level":"info","ts":1659111920.3201926,"logger":"controller-runtime.certwatcher","msg":"Starting certificate watcher"}
{"level":"info","ts":1659111920.3203773,"logger":"controller-runtime.webhook","msg":"serving webhook server","host":"","port":443}
{"level":"info","ts":1659111920.3206003,"logger":"controller-runtime.manager.controller.suggestion-controller","msg":"Starting EventSource","source":"kind source: /, Kind="}
{"level":"info","ts":1659111920.320677,"logger":"controller-runtime.manager.controller.experiment-controller","msg":"Starting EventSource","source":"kind source: /, Kind="}
{"level":"info","ts":1659111920.3208,"logger":"controller-runtime.manager.controller.trial-controller","msg":"Starting EventSource","source":"kind source: /, Kind="}
{"level":"info","ts":1659111920.421475,"logger":"controller-runtime.manager.controller.trial-controller","msg":"Starting EventSource","source":"kind source: batch/v1, Kind=Job"}
{"level":"info","ts":1659111920.4215267,"logger":"controller-runtime.manager.controller.experiment-controller","msg":"Starting EventSource","source":"kind source: /, Kind="}
{"level":"info","ts":1659111920.4215186,"logger":"controller-runtime.manager.controller.suggestion-controller","msg":"Starting EventSource","source":"kind source: /, Kind="}
{"level":"info","ts":1659111920.4215727,"logger":"controller-runtime.manager.controller.experiment-controller","msg":"Starting EventSource","source":"kind source: /, Kind="}
{"level":"info","ts":1659111920.421748,"logger":"controller-runtime.manager.controller.experiment-controller","msg":"Starting Controller"}
{"level":"info","ts":1659111920.5224676,"logger":"controller-runtime.manager.controller.suggestion-controller","msg":"Starting EventSource","source":"kind source: /, Kind="}
{"level":"info","ts":1659111920.5224283,"logger":"controller-runtime.manager.controller.trial-controller","msg":"Starting EventSource","source":"kind source: kubeflow.org/v1, Kind=TFJob"}
{"level":"info","ts":1659111920.6236901,"logger":"controller-runtime.manager.controller.suggestion-controller","msg":"Starting EventSource","source":"kind source: /, Kind="}
{"level":"info","ts":1659111920.6236699,"logger":"controller-runtime.manager.controller.trial-controller","msg":"Starting EventSource","source":"kind source: kubeflow.org/v1, Kind=PyTorchJob"}
{"level":"info","ts":1659111920.724951,"logger":"controller-runtime.manager.controller.experiment-controller","msg":"Starting workers","worker count":1}
{"level":"info","ts":1659111920.7249997,"logger":"controller-runtime.manager.controller.trial-controller","msg":"Starting Controller"}
{"level":"info","ts":1659111920.7250733,"logger":"controller-runtime.manager.controller.suggestion-controller","msg":"Starting Controller"}
{"level":"info","ts":1659111920.725114,"logger":"controller-runtime.manager.controller.suggestion-controller","msg":"Starting workers","worker count":1}
{"level":"info","ts":1659111920.7251475,"logger":"controller-runtime.manager.controller.trial-controller","msg":"Starting workers","worker count":1}
{"level":"info","ts":1659111920.7252057,"logger":"experiment-controller","msg":"Statistics","Experiment":"admin/random-experiment","requiredActiveCount":3,"parallelCount":3,"activeCount":0,"completedCount":0}
{"level":"info","ts":1659111920.7252252,"logger":"experiment-controller","msg":"Reconcile Suggestion","Experiment":"admin/random-experiment","addCount":3}
{"level":"info","ts":1659111920.7252333,"logger":"experiment-controller","msg":"GetOrCreateSuggestion","Experiment":"admin/random-experiment","name":"random-experiment","Suggestion Requests":3}
E0729 16:25:20.725373       1 runtime.go:78] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
goroutine 755 [running]:
k8s.io/apimachinery/pkg/util/runtime.logPanic({0x166dbc0, 0x2730890})
        /go/pkg/mod/k8s.io/[email protected]/pkg/util/runtime/runtime.go:74 +0x85
k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0xc000548ac0})
        /go/pkg/mod/k8s.io/[email protected]/pkg/util/runtime/runtime.go:48 +0x75
panic({0x166dbc0, 0x2730890})
        /usr/local/go/src/runtime/panic.go:1038 +0x215
github.com/kubeflow/katib/pkg/controller.v1beta1/experiment/manifest.(*DefaultGenerator).GetTrialTemplate(0xc0005a9750, 0xc000ac93d0)
        /go/src/github.com/kubeflow/katib/pkg/controller.v1beta1/experiment/manifest/generator.go:199 +0xad
github.com/kubeflow/katib/pkg/controller.v1beta1/experiment/manifest.(*DefaultGenerator).applyParameters(0x0, 0xc000884dc0, {0xc000290ec0, 0x1a}, {0xc0008d3729, 0x5}, {0xc00037b440, 0x3, 0x0})
        /go/src/github.com/kubeflow/katib/pkg/controller.v1beta1/experiment/manifest/generator.go:100 +0x66
github.com/kubeflow/katib/pkg/controller.v1beta1/experiment/manifest.(*DefaultGenerator).GetRunSpecWithHyperParameters(0x1b39630, 0xc000884dc0, {0xc000290ec0, 0x1a}, {0xc0008d3729, 0x5}, {0xc00037b440, 0x14, 0x300000000})
        /go/src/github.com/kubeflow/katib/pkg/controller.v1beta1/experiment/manifest/generator.go:81 +0x45
github.com/kubeflow/katib/pkg/controller.v1beta1/experiment.(*ReconcileExperiment).createTrialInstance(0xc000280480, 0xc000884dc0, 0xc000ac98a0)
        /go/src/github.com/kubeflow/katib/pkg/controller.v1beta1/experiment/experiment_controller_util.go:61 +0x2d0
github.com/kubeflow/katib/pkg/controller.v1beta1/experiment.(*ReconcileExperiment).createTrials(0xc0005cf8f0, 0xc000884dc0, {0x27879b8, 0x0, 0x8}, 0x8)
        /go/src/github.com/kubeflow/katib/pkg/controller.v1beta1/experiment/experiment_controller.go:358 +0x39a
github.com/kubeflow/katib/pkg/controller.v1beta1/experiment.(*ReconcileExperiment).ReconcileTrials(0xc0005a0910, 0xc000884dc0, {0x27879b8, 0x0, 0x0})
        /go/src/github.com/kubeflow/katib/pkg/controller.v1beta1/experiment/experiment_controller.go:335 +0x62c
github.com/kubeflow/katib/pkg/controller.v1beta1/experiment.(*ReconcileExperiment).ReconcileExperiment(0xc000280480, 0xc000884dc0)
        /go/src/github.com/kubeflow/katib/pkg/controller.v1beta1/experiment/experiment_controller.go:281 +0x2cf
github.com/kubeflow/katib/pkg/controller.v1beta1/experiment.(*ReconcileExperiment).Reconcile(0xc000280480, {0x1aee138, 0xc0004b74a0}, {{{0xc0008d3729, 0x5}, {0xc00062b7e8, 0x11}}})
        /go/src/github.com/kubeflow/katib/pkg/controller.v1beta1/experiment/experiment_controller.go:239 +0x61c
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc0007075e0, {0x1aee090, 0xc00092bd00}, {0x16cf5c0, 0xc000548ac0})
        /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:297 +0x303
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc0007075e0, {0x1aee090, 0xc00092bd00})
        /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:252 +0x205
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1.2({0x1aee090, 0xc00092bd00})
        /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:215 +0x46
k8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext.func1()
        /go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:185 +0x25
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x7f1895b22b98)
        /go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:155 +0x67
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0x0, {0x1ac26a0, 0xc0004b7440}, 0x1, 0xc000560540)
        /go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:156 +0xb6
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc000098ae0, 0x3b9aca00, 0x0, 0x20, 0x0)
        /go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:133 +0x89
k8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext({0x1aee090, 0xc00092bd00}, 0xc0005cf790, 0x0, 0x0, 0x20)
        /go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:185 +0x99
k8s.io/apimachinery/pkg/util/wait.UntilWithContext({0x1aee090, 0xc00092bd00}, 0xc0009adce0, 0xc0009eb728)
        /go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:99 +0x2b
created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1
        /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:212 +0x356
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
        panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x10 pc=0x12a456d]

goroutine 755 [running]:
k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0xc000548ac0})
        /go/pkg/mod/k8s.io/[email protected]/pkg/util/runtime/runtime.go:55 +0xd8
panic({0x166dbc0, 0x2730890})
        /usr/local/go/src/runtime/panic.go:1038 +0x215
github.com/kubeflow/katib/pkg/controller.v1beta1/experiment/manifest.(*DefaultGenerator).GetTrialTemplate(0xc0005a9750, 0xc000ac93d0)
        /go/src/github.com/kubeflow/katib/pkg/controller.v1beta1/experiment/manifest/generator.go:199 +0xad
github.com/kubeflow/katib/pkg/controller.v1beta1/experiment/manifest.(*DefaultGenerator).applyParameters(0x0, 0xc000884dc0, {0xc000290ec0, 0x1a}, {0xc0008d3729, 0x5}, {0xc00037b440, 0x3, 0x0})
        /go/src/github.com/kubeflow/katib/pkg/controller.v1beta1/experiment/manifest/generator.go:100 +0x66
github.com/kubeflow/katib/pkg/controller.v1beta1/experiment/manifest.(*DefaultGenerator).GetRunSpecWithHyperParameters(0x1b39630, 0xc000884dc0, {0xc000290ec0, 0x1a}, {0xc0008d3729, 0x5}, {0xc00037b440, 0x14, 0x300000000})
        /go/src/github.com/kubeflow/katib/pkg/controller.v1beta1/experiment/manifest/generator.go:81 +0x45
github.com/kubeflow/katib/pkg/controller.v1beta1/experiment.(*ReconcileExperiment).createTrialInstance(0xc000280480, 0xc000884dc0, 0xc000ac98a0)
        /go/src/github.com/kubeflow/katib/pkg/controller.v1beta1/experiment/experiment_controller_util.go:61 +0x2d0
github.com/kubeflow/katib/pkg/controller.v1beta1/experiment.(*ReconcileExperiment).createTrials(0xc0005cf8f0, 0xc000884dc0, {0x27879b8, 0x0, 0x8}, 0x8)
        /go/src/github.com/kubeflow/katib/pkg/controller.v1beta1/experiment/experiment_controller.go:358 +0x39a
github.com/kubeflow/katib/pkg/controller.v1beta1/experiment.(*ReconcileExperiment).ReconcileTrials(0xc0005a0910, 0xc000884dc0, {0x27879b8, 0x0, 0x0})
        /go/src/github.com/kubeflow/katib/pkg/controller.v1beta1/experiment/experiment_controller.go:335 +0x62c
github.com/kubeflow/katib/pkg/controller.v1beta1/experiment.(*ReconcileExperiment).ReconcileExperiment(0xc000280480, 0xc000884dc0)
        /go/src/github.com/kubeflow/katib/pkg/controller.v1beta1/experiment/experiment_controller.go:281 +0x2cf
github.com/kubeflow/katib/pkg/controller.v1beta1/experiment.(*ReconcileExperiment).Reconcile(0xc000280480, {0x1aee138, 0xc0004b74a0}, {{{0xc0008d3729, 0x5}, {0xc00062b7e8, 0x11}}})
        /go/src/github.com/kubeflow/katib/pkg/controller.v1beta1/experiment/experiment_controller.go:239 +0x61c
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc0007075e0, {0x1aee090, 0xc00092bd00}, {0x16cf5c0, 0xc000548ac0})
        /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:297 +0x303
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc0007075e0, {0x1aee090, 0xc00092bd00})
        /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:252 +0x205
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1.2({0x1aee090, 0xc00092bd00})
        /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:215 +0x46
k8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext.func1()
        /go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:185 +0x25
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x7f1895b22b98)
        /go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:155 +0x67
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0x0, {0x1ac26a0, 0xc0004b7440}, 0x1, 0xc000560540)
        /go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:156 +0xb6
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc000098ae0, 0x3b9aca00, 0x0, 0x20, 0x0)
        /go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:133 +0x89
k8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext({0x1aee090, 0xc00092bd00}, 0xc0005cf790, 0x0, 0x0, 0x20)
        /go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:185 +0x99
k8s.io/apimachinery/pkg/util/wait.UntilWithContext({0x1aee090, 0xc00092bd00}, 0xc0009adce0, 0xc0009eb728)
        /go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:99 +0x2b
created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1
        /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:212 +0x356

nobuto-m changed the title ~~katib-controller doesn't have trial-resources=Workflow.v1alpha1.argoproj.io out of the box~~ AutoML jobs times out in pipeline until katib-controller is restarted Jun 9, 2022

misohu added the bug Something isn't working label Aug 18, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AutoML jobs times out in pipeline until katib-controller is restarted #459

AutoML jobs times out in pipeline until katib-controller is restarted #459

nobuto-m commented Jun 5, 2022 •

edited

Loading

natalian98 commented Jun 7, 2022

nobuto-m commented Jun 7, 2022

natalian98 commented Jun 7, 2022 •

edited

Loading

nobuto-m commented Jun 7, 2022

nobuto-m commented Jun 7, 2022 •

edited

Loading

nobuto-m commented Jun 9, 2022

natalian98 commented Jun 10, 2022

nobuto-m commented Jun 13, 2022

agathanatasha commented Jul 29, 2022 •

edited

Loading

AutoML jobs times out in pipeline until katib-controller is restarted #459

AutoML jobs times out in pipeline until katib-controller is restarted #459

Comments

nobuto-m commented Jun 5, 2022 • edited Loading

natalian98 commented Jun 7, 2022

nobuto-m commented Jun 7, 2022

natalian98 commented Jun 7, 2022 • edited Loading

nobuto-m commented Jun 7, 2022

nobuto-m commented Jun 7, 2022 • edited Loading

nobuto-m commented Jun 9, 2022

natalian98 commented Jun 10, 2022

nobuto-m commented Jun 13, 2022

agathanatasha commented Jul 29, 2022 • edited Loading

nobuto-m commented Jun 5, 2022 •

edited

Loading

natalian98 commented Jun 7, 2022 •

edited

Loading

nobuto-m commented Jun 7, 2022 •

edited

Loading

agathanatasha commented Jul 29, 2022 •

edited

Loading