Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AutoML jobs times out in pipeline until katib-controller is restarted #459

Open
nobuto-m opened this issue Jun 5, 2022 · 9 comments
Open
Labels
bug Something isn't working

Comments

@nobuto-m
Copy link

nobuto-m commented Jun 5, 2022

How to reproduce (based on the quickstart doc):

  1. sudo snap install microk8s --classic --channel 1.21
  2. microk8s enable dns storage ingress metallb:10.64.140.43-10.64.140.49
  3. deploy kubeflow:
    juju bootstrap microk8s
    juju add-model kubeflow
    juju deploy --trust kubeflow
    juju config dex-auth public-url=http://10.64.140.43.nip.io
    juju config oidc-gatekeeper public-url=http://10.64.140.43.nip.io
    juju config dex-auth static-username=admin
    juju config dex-auth static-password=admin
  4. juju refresh kfp-profile-controller --channel edge (to apply access-ml-pipeline: "true" for PodDefault)
  5. Create a Jupyter notebook with Allow access to Kubeflow Pipelines in Kubeflow UI
  6. Import an example kubeflow pipeline notebook and run it from the top:
    https://github.com/kubeflow/katib/blob/master/examples/v1beta1/kubeflow-pipelines/kubeflow-e2e-mnist.ipynb
  7. An experiment (AutoML) job is created as a part of the pipeline, but no trial is complete and the run eventually hits timeout (60min)

Expected:
All trials are complete.

Solution/workaround:

Once Workflow.v1alpha1.argoproj.io is added to trial-resources of katib-controller. All trials are complete.
https://github.com/kubeflow/katib/blob/master/examples/v1beta1/argo/README.md#katib-controller
Since both katib-controller and argo are managed by Juju charms, it would be good to see some improvements in this user scenario. For the record, Katib Metrics Collector sidecar injection is enabled out of the box.

OR

Simply restart katib-controller without changing anything. Adding Workflow.v1alpha1.argoproj.io might be a red herring since it actually recreate a pod.

kubectl -n kubeflow rollout restart deployment/katib-controller
$ microk8s kubectl get namespace admin -o yaml
apiVersion: v1
kind: Namespace
metadata:
  annotations:
    owner: admin
  creationTimestamp: "2022-06-05T14:29:50Z"
  labels:
    app.kubernetes.io/part-of: kubeflow-profile
    istio-injection: enabled
    katib-metricscollector-injection: enabled
    katib.kubeflow.org/metrics-collector-injection: enabled
    kubernetes.io/metadata.name: admin
    pipelines.kubeflow.org/enabled: "true"
    serving.kubeflow.org/inferenceservice: enabled
...

[out of the box]

$ pgrep -af katib-controller
54059 /bin/sh -c export JUJU_DATA_DIR=/var/lib/juju export JUJU_TOOLS_DIR=$JUJU_DATA_DIR/tools  mkdir -p $JUJU_TOOLS_DIR cp /opt/jujud $JUJU_TOOLS_DIR/jujud  $JUJU_TOOLS_DIR/jujud caasoperator --application-name=katib-controller --debug 
54095 /var/lib/juju/tools/jujud caasoperator --application-name=katib-controller --debug
80781 ./katib-controller --webhook-port=443 --trial-resources=Job.v1.batch --trial-resources=TFJob.v1.kubeflow.org --trial-resources=PyTorchJob.v1.kubeflow.org --trial-resources=MPIJob.v1.kubeflow.org --trial-resources=PipelineRun.v1beta1.tekton.dev

[patched]

$ kubectl patch Deployment katib-controller -n kubeflow --type=json \
  -p='[{"op": "add", "path": "/spec/template/spec/containers/0/args/-", "value": "--trial-resources=Workflow.v1alpha1.argoproj.io"}]'
$ pgrep -af katib-controller
54059 /bin/sh -c export JUJU_DATA_DIR=/var/lib/juju export JUJU_TOOLS_DIR=$JUJU_DATA_DIR/tools  mkdir -p $JUJU_TOOLS_DIR cp /opt/jujud $JUJU_TOOLS_DIR/jujud  $JUJU_TOOLS_DIR/jujud caasoperator --application-name=katib-controller --debug 
54095 /var/lib/juju/tools/jujud caasoperator --application-name=katib-controller --debug
808728 ./katib-controller --webhook-port=443 --trial-resources=Job.v1.batch --trial-resources=TFJob.v1.kubeflow.org --trial-resources=PyTorchJob.v1.kubeflow.org --trial-resources=MPIJob.v1.kubeflow.org --trial-resources=PipelineRun.v1beta1.tekton.dev --trial-resources=Workflow.v1alpha1.argoproj.io
@natalian98
Copy link
Contributor

I tried to reproduce your issue but in my case the trials get completed without patching the katib deployment:
image

The run fails due to kfserving being unavailable, but there seems to be no issue with katib.
image

Could you share your notebook configuration (e.g. image, cpu, ram)?

@nobuto-m
Copy link
Author

nobuto-m commented Jun 7, 2022

I tried to reproduce your issue but in my case the trials get completed without patching the katib deployment:

Hmm, interesting. I had the timeout 100% so far.

The run fails due to kfserving being unavailable, but there seems to be no issue with katib.

Yup, kfserving failure is expected.

Could you share your notebook configuration (e.g. image, cpu, ram)?

I don't think of anything off the top of my head. I just used the default values and it shouldn't affect the pipeline execution since it just creates a pipeline not running it if I'm not mistaken.

Just to confirm, have you used microk8s?

@natalian98
Copy link
Contributor

natalian98 commented Jun 7, 2022

Just to confirm, have you used microk8s?

Yes, I used microk8s 1.21/stable.

@nobuto-m
Copy link
Author

nobuto-m commented Jun 7, 2022

Hmm, trial-resources=Workflow.v1alpha1.argoproj.io might be a red herring. The issue is still reproducible in my environment, but after adding trial-resources=Workflow.v1alpha1.argoproj.io and removing it again, the trial jobs complete.

So the key to workaround the issue might be restarting/recreating the katib-controller pod. It's puzzling why it's reproducible on my testbed but not on the other environment though.

@nobuto-m
Copy link
Author

nobuto-m commented Jun 7, 2022

So the key to workaround the issue might be restarting/recreating the katib-controller pod. It's puzzling why it's reproducible on my testbed but not on the other environment though.

microk8s kubectl -n kubeflow rollout restart deployment/katib-controller does the trick to unstuck not completing trials somehow.

The controller was restarted around 15:31.

2022-06-07T15:31:48.597347405Z stderr F 2022-06-07 15:31:48 WARNING juju.worker.caasoperator caasoperator.go:554 stopping uniter for dead unit "katib-controller/0": worker "katib-controller/0" not found

containers.log

@nobuto-m nobuto-m changed the title katib-controller doesn't have trial-resources=Workflow.v1alpha1.argoproj.io out of the box AutoML jobs times out in pipeline until katib-controller is restarted Jun 9, 2022
@nobuto-m
Copy link
Author

nobuto-m commented Jun 9, 2022

I've reproduced it successfully in a clean environment. The steps are almost identical with the one in the description. Hope it helps for you to reproduce it on your end, @natalian98

  1. Launch an AWS instance with the following config

    • Instance type: t3.2xlarge
    • AMI ID: ami-0a3eb6ca097b78895
    • AMI name: ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20220419
    • storage: 64GB
  2. Create a group in advance

sudo addgroup --system microk8s
sudo adduser $USER microk8s
  1. logout and login again

  2. Run a script

git clone https://github.com/nobuto-m/quick-kubeflow.git
cd quick-kubeflow
git checkout 376c501
time ./redeploy-microk8s-kubeflow.sh
## -> 32 min
  1. Connect from local laptop/desktop
sshuttle -r ubuntu@PUBLIC_IP_OF_AWS_INSTANCE 10.64.140.43

Then open
http://10.64.140.43.nip.io/

  1. Create a Jupyter notebook instance

    • name: first-notebook
    • image: j1r0q0g6/notebooks/notebook-servers/jupyter-scipy:v1.4
    • configuration: Allow access to Kubeflow Pipelines - ✓
  2. Import and run the notebook
    https://raw.githubusercontent.com/kubeflow/katib/fe2ae99d5b8c58a0f56221bb9a58afc131bfafc4/examples/v1beta1/kubeflow-pipelines/kubeflow-e2e-mnist.ipynb

@natalian98
Copy link
Contributor

Thanks for providing the detailed steps @nobuto-m.

I deployed kubeflow once more using your instructions and the trials still succeed, provided that the notebook's CPU and memory values are increased.

When using the default CPU==0.5 and memory==1Gi, some garbage collection errors can be observed, which may be the reason why the trials get stuck. Katib-controller wasn't restarted.

After creating a notebook with 2 CPUs and 4Gi memory, the trials succeed.

@nobuto-m
Copy link
Author

I deployed kubeflow once more using your instructions and the trials still succeed, provided that the notebook's CPU and memory values are increased.

When using the default CPU==0.5 and memory==1Gi, some garbage collection errors can be observed, which may be the reason why the trials get stuck. Katib-controller wasn't restarted.

After creating a notebook with 2 CPUs and 4Gi memory, the trials succeed.

Thanks for testing. I bumped CPU and memory to 4 CPUs and 8Gi memory, but still it doesn't work for me.

Also, I'm confused because I thought a notebook instance was irrelevant to the pipeline run. The pipeline is defined from the notebook, but the notebook instance can be deleted before re-running the pipeline if I'm not mistaken. So I'm wondering how the spec of the notebook instance affects the pipeline. Am I missing something?

some garbage collection errors can be observed

Which log did you see this in? It might not be the notebook instance, but if there was any error from any component, we can dig in.

@agathanatasha
Copy link
Contributor

agathanatasha commented Jul 29, 2022

I am experiencing similar failure. Katib-controller would deploy successfully. The unit would goes into crashloopbackoff after submitting an autoML experiment with all the default configurations (pod logs provided below).
Patching the katib-controller deployment with trial resources seems to resolve that error. Restarting the deployment doesn't work for me.

logs from katib-controller
ubuntu@ip-172-31-26-151:~$ uk -n kubeflow logs katib-controller-f5f96cfdd-z6pbd 
{"level":"info","ts":1659111913.112697,"logger":"entrypoint","msg":"Config:","experiment-suggestion-name":"default","webhook-port":443,"metrics-addr":":8080","inject-security-context":false,"enable-grpc-probe-in-suggestion":true,"trial-resources":[{"Group":"batch","Version":"v1","Kind":"Job"},{"Group":"kubeflow.org","Version":"v1","Kind":"TFJob"},{"Group":"kubeflow.org","Version":"v1","Kind":"PyTorchJob"},{"Group":"kubeflow.org","Version":"v1","Kind":"MPIJob"},{"Group":"tekton.dev","Version":"v1beta1","Kind":"PipelineRun"}]}
I0729 16:25:14.165000       1 request.go:655] Throttling request took 1.040866117s, request: GET:https://10.152.183.1:443/apis/admissionregistration.k8s.io/v1?timeout=32s
{"level":"info","ts":1659111914.5194237,"logger":"controller-runtime.metrics","msg":"metrics server is starting to listen","addr":":8080"}
{"level":"info","ts":1659111914.5197005,"logger":"entrypoint","msg":"Registering Components."}
{"level":"info","ts":1659111914.5198681,"logger":"entrypoint","msg":"Setting up controller."}
{"level":"info","ts":1659111914.5198922,"logger":"experiment-controller","msg":"Using the default suggestion implementation"}
{"level":"info","ts":1659111914.5199733,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"}
{"level":"info","ts":1659111914.5200028,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"}
{"level":"info","ts":1659111914.5200174,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"}
{"level":"info","ts":1659111914.5200253,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"}
{"level":"info","ts":1659111914.5200374,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"}
{"level":"info","ts":1659111914.520045,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"}
{"level":"info","ts":1659111914.5200517,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"}
{"level":"info","ts":1659111914.5200558,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"}
{"level":"info","ts":1659111914.5200624,"logger":"experiment-controller","msg":"Experiment controller created"}
{"level":"info","ts":1659111914.5200937,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"}
{"level":"info","ts":1659111914.520104,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"}
{"level":"info","ts":1659111914.5201106,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"}
{"level":"info","ts":1659111914.5201147,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"}
{"level":"info","ts":1659111914.5201201,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"}
{"level":"info","ts":1659111914.5201275,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"}
{"level":"info","ts":1659111914.5201344,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"}
{"level":"info","ts":1659111914.5201378,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"}
{"level":"info","ts":1659111914.5201442,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"}
{"level":"info","ts":1659111914.5201483,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"}
{"level":"info","ts":1659111914.5201523,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"}
{"level":"info","ts":1659111914.5201564,"logger":"suggestion-controller","msg":"Suggestion controller created"}
{"level":"info","ts":1659111914.5202193,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"}
{"level":"info","ts":1659111914.5202289,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"}
{"level":"info","ts":1659111914.520279,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"}
{"level":"info","ts":1659111914.520288,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"}
{"level":"info","ts":1659111914.5203009,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"}
{"level":"info","ts":1659111914.5203068,"logger":"trial-controller","msg":"Job watch added successfully","CRD Group":"batch","CRD Version":"v1","CRD Kind":"Job"}
{"level":"info","ts":1659111914.5203474,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"}
{"level":"info","ts":1659111914.5203562,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"}
{"level":"info","ts":1659111914.5203605,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"}
{"level":"info","ts":1659111914.5203655,"logger":"trial-controller","msg":"Job watch added successfully","CRD Group":"kubeflow.org","CRD Version":"v1","CRD Kind":"TFJob"}
{"level":"info","ts":1659111914.5203984,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"}
{"level":"info","ts":1659111914.5204074,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"}
{"level":"info","ts":1659111914.5204117,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"}
{"level":"info","ts":1659111914.520415,"logger":"trial-controller","msg":"Job watch added successfully","CRD Group":"kubeflow.org","CRD Version":"v1","CRD Kind":"PyTorchJob"}
{"level":"info","ts":1659111917.416891,"logger":"trial-controller","msg":"Job watch error. CRD might be missing. Please install CRD and restart katib-controller","CRD Group":"kubeflow.org","CRD Version":"v1","CRD Kind":"MPIJob"}
{"level":"info","ts":1659111920.3188462,"logger":"trial-controller","msg":"Job watch error. CRD might be missing. Please install CRD and restart katib-controller","CRD Group":"tekton.dev","CRD Version":"v1beta1","CRD Kind":"PipelineRun"}
{"level":"info","ts":1659111920.318882,"logger":"trial-controller","msg":"Trial controller created"}
{"level":"info","ts":1659111920.3188865,"logger":"entrypoint","msg":"Setting up webhooks."}
{"level":"info","ts":1659111920.3188999,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"}
{"level":"info","ts":1659111920.3190033,"logger":"controller-runtime.webhook","msg":"registering webhook","path":"/validate-experiment"}
{"level":"info","ts":1659111920.31902,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"}
{"level":"info","ts":1659111920.3190272,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"}
{"level":"info","ts":1659111920.3190315,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"}
{"level":"info","ts":1659111920.3190677,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"}
{"level":"info","ts":1659111920.3190932,"logger":"controller-runtime.webhook","msg":"registering webhook","path":"/mutate-experiment"}
{"level":"info","ts":1659111920.3190963,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"}
{"level":"info","ts":1659111920.3191006,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"}
{"level":"info","ts":1659111920.319103,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"}
{"level":"info","ts":1659111920.3191202,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"}
{"level":"info","ts":1659111920.3191442,"logger":"controller-runtime.webhook","msg":"registering webhook","path":"/mutate-pod"}
{"level":"info","ts":1659111920.319147,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"}
{"level":"info","ts":1659111920.319153,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"}
{"level":"info","ts":1659111920.3191555,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"}
{"level":"info","ts":1659111920.3191712,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"}
{"level":"info","ts":1659111920.3191738,"logger":"entrypoint","msg":"Starting the Cmd."}
{"level":"info","ts":1659111920.3196013,"logger":"controller-runtime.manager","msg":"starting metrics server","path":"/metrics"}
{"level":"info","ts":1659111920.3195987,"logger":"controller-runtime.webhook.webhooks","msg":"starting webhook server"}
{"level":"info","ts":1659111920.3198805,"logger":"controller-runtime.certwatcher","msg":"Updated current TLS certificate"}
{"level":"info","ts":1659111920.3201926,"logger":"controller-runtime.certwatcher","msg":"Starting certificate watcher"}
{"level":"info","ts":1659111920.3203773,"logger":"controller-runtime.webhook","msg":"serving webhook server","host":"","port":443}
{"level":"info","ts":1659111920.3206003,"logger":"controller-runtime.manager.controller.suggestion-controller","msg":"Starting EventSource","source":"kind source: /, Kind="}
{"level":"info","ts":1659111920.320677,"logger":"controller-runtime.manager.controller.experiment-controller","msg":"Starting EventSource","source":"kind source: /, Kind="}
{"level":"info","ts":1659111920.3208,"logger":"controller-runtime.manager.controller.trial-controller","msg":"Starting EventSource","source":"kind source: /, Kind="}
{"level":"info","ts":1659111920.421475,"logger":"controller-runtime.manager.controller.trial-controller","msg":"Starting EventSource","source":"kind source: batch/v1, Kind=Job"}
{"level":"info","ts":1659111920.4215267,"logger":"controller-runtime.manager.controller.experiment-controller","msg":"Starting EventSource","source":"kind source: /, Kind="}
{"level":"info","ts":1659111920.4215186,"logger":"controller-runtime.manager.controller.suggestion-controller","msg":"Starting EventSource","source":"kind source: /, Kind="}
{"level":"info","ts":1659111920.4215727,"logger":"controller-runtime.manager.controller.experiment-controller","msg":"Starting EventSource","source":"kind source: /, Kind="}
{"level":"info","ts":1659111920.421748,"logger":"controller-runtime.manager.controller.experiment-controller","msg":"Starting Controller"}
{"level":"info","ts":1659111920.5224676,"logger":"controller-runtime.manager.controller.suggestion-controller","msg":"Starting EventSource","source":"kind source: /, Kind="}
{"level":"info","ts":1659111920.5224283,"logger":"controller-runtime.manager.controller.trial-controller","msg":"Starting EventSource","source":"kind source: kubeflow.org/v1, Kind=TFJob"}
{"level":"info","ts":1659111920.6236901,"logger":"controller-runtime.manager.controller.suggestion-controller","msg":"Starting EventSource","source":"kind source: /, Kind="}
{"level":"info","ts":1659111920.6236699,"logger":"controller-runtime.manager.controller.trial-controller","msg":"Starting EventSource","source":"kind source: kubeflow.org/v1, Kind=PyTorchJob"}
{"level":"info","ts":1659111920.724951,"logger":"controller-runtime.manager.controller.experiment-controller","msg":"Starting workers","worker count":1}
{"level":"info","ts":1659111920.7249997,"logger":"controller-runtime.manager.controller.trial-controller","msg":"Starting Controller"}
{"level":"info","ts":1659111920.7250733,"logger":"controller-runtime.manager.controller.suggestion-controller","msg":"Starting Controller"}
{"level":"info","ts":1659111920.725114,"logger":"controller-runtime.manager.controller.suggestion-controller","msg":"Starting workers","worker count":1}
{"level":"info","ts":1659111920.7251475,"logger":"controller-runtime.manager.controller.trial-controller","msg":"Starting workers","worker count":1}
{"level":"info","ts":1659111920.7252057,"logger":"experiment-controller","msg":"Statistics","Experiment":"admin/random-experiment","requiredActiveCount":3,"parallelCount":3,"activeCount":0,"completedCount":0}
{"level":"info","ts":1659111920.7252252,"logger":"experiment-controller","msg":"Reconcile Suggestion","Experiment":"admin/random-experiment","addCount":3}
{"level":"info","ts":1659111920.7252333,"logger":"experiment-controller","msg":"GetOrCreateSuggestion","Experiment":"admin/random-experiment","name":"random-experiment","Suggestion Requests":3}
E0729 16:25:20.725373       1 runtime.go:78] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
goroutine 755 [running]:
k8s.io/apimachinery/pkg/util/runtime.logPanic({0x166dbc0, 0x2730890})
        /go/pkg/mod/k8s.io/[email protected]/pkg/util/runtime/runtime.go:74 +0x85
k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0xc000548ac0})
        /go/pkg/mod/k8s.io/[email protected]/pkg/util/runtime/runtime.go:48 +0x75
panic({0x166dbc0, 0x2730890})
        /usr/local/go/src/runtime/panic.go:1038 +0x215
github.com/kubeflow/katib/pkg/controller.v1beta1/experiment/manifest.(*DefaultGenerator).GetTrialTemplate(0xc0005a9750, 0xc000ac93d0)
        /go/src/github.com/kubeflow/katib/pkg/controller.v1beta1/experiment/manifest/generator.go:199 +0xad
github.com/kubeflow/katib/pkg/controller.v1beta1/experiment/manifest.(*DefaultGenerator).applyParameters(0x0, 0xc000884dc0, {0xc000290ec0, 0x1a}, {0xc0008d3729, 0x5}, {0xc00037b440, 0x3, 0x0})
        /go/src/github.com/kubeflow/katib/pkg/controller.v1beta1/experiment/manifest/generator.go:100 +0x66
github.com/kubeflow/katib/pkg/controller.v1beta1/experiment/manifest.(*DefaultGenerator).GetRunSpecWithHyperParameters(0x1b39630, 0xc000884dc0, {0xc000290ec0, 0x1a}, {0xc0008d3729, 0x5}, {0xc00037b440, 0x14, 0x300000000})
        /go/src/github.com/kubeflow/katib/pkg/controller.v1beta1/experiment/manifest/generator.go:81 +0x45
github.com/kubeflow/katib/pkg/controller.v1beta1/experiment.(*ReconcileExperiment).createTrialInstance(0xc000280480, 0xc000884dc0, 0xc000ac98a0)
        /go/src/github.com/kubeflow/katib/pkg/controller.v1beta1/experiment/experiment_controller_util.go:61 +0x2d0
github.com/kubeflow/katib/pkg/controller.v1beta1/experiment.(*ReconcileExperiment).createTrials(0xc0005cf8f0, 0xc000884dc0, {0x27879b8, 0x0, 0x8}, 0x8)
        /go/src/github.com/kubeflow/katib/pkg/controller.v1beta1/experiment/experiment_controller.go:358 +0x39a
github.com/kubeflow/katib/pkg/controller.v1beta1/experiment.(*ReconcileExperiment).ReconcileTrials(0xc0005a0910, 0xc000884dc0, {0x27879b8, 0x0, 0x0})
        /go/src/github.com/kubeflow/katib/pkg/controller.v1beta1/experiment/experiment_controller.go:335 +0x62c
github.com/kubeflow/katib/pkg/controller.v1beta1/experiment.(*ReconcileExperiment).ReconcileExperiment(0xc000280480, 0xc000884dc0)
        /go/src/github.com/kubeflow/katib/pkg/controller.v1beta1/experiment/experiment_controller.go:281 +0x2cf
github.com/kubeflow/katib/pkg/controller.v1beta1/experiment.(*ReconcileExperiment).Reconcile(0xc000280480, {0x1aee138, 0xc0004b74a0}, {{{0xc0008d3729, 0x5}, {0xc00062b7e8, 0x11}}})
        /go/src/github.com/kubeflow/katib/pkg/controller.v1beta1/experiment/experiment_controller.go:239 +0x61c
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc0007075e0, {0x1aee090, 0xc00092bd00}, {0x16cf5c0, 0xc000548ac0})
        /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:297 +0x303
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc0007075e0, {0x1aee090, 0xc00092bd00})
        /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:252 +0x205
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1.2({0x1aee090, 0xc00092bd00})
        /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:215 +0x46
k8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext.func1()
        /go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:185 +0x25
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x7f1895b22b98)
        /go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:155 +0x67
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0x0, {0x1ac26a0, 0xc0004b7440}, 0x1, 0xc000560540)
        /go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:156 +0xb6
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc000098ae0, 0x3b9aca00, 0x0, 0x20, 0x0)
        /go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:133 +0x89
k8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext({0x1aee090, 0xc00092bd00}, 0xc0005cf790, 0x0, 0x0, 0x20)
        /go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:185 +0x99
k8s.io/apimachinery/pkg/util/wait.UntilWithContext({0x1aee090, 0xc00092bd00}, 0xc0009adce0, 0xc0009eb728)
        /go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:99 +0x2b
created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1
        /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:212 +0x356
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
        panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x10 pc=0x12a456d]

goroutine 755 [running]:
k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0xc000548ac0})
        /go/pkg/mod/k8s.io/[email protected]/pkg/util/runtime/runtime.go:55 +0xd8
panic({0x166dbc0, 0x2730890})
        /usr/local/go/src/runtime/panic.go:1038 +0x215
github.com/kubeflow/katib/pkg/controller.v1beta1/experiment/manifest.(*DefaultGenerator).GetTrialTemplate(0xc0005a9750, 0xc000ac93d0)
        /go/src/github.com/kubeflow/katib/pkg/controller.v1beta1/experiment/manifest/generator.go:199 +0xad
github.com/kubeflow/katib/pkg/controller.v1beta1/experiment/manifest.(*DefaultGenerator).applyParameters(0x0, 0xc000884dc0, {0xc000290ec0, 0x1a}, {0xc0008d3729, 0x5}, {0xc00037b440, 0x3, 0x0})
        /go/src/github.com/kubeflow/katib/pkg/controller.v1beta1/experiment/manifest/generator.go:100 +0x66
github.com/kubeflow/katib/pkg/controller.v1beta1/experiment/manifest.(*DefaultGenerator).GetRunSpecWithHyperParameters(0x1b39630, 0xc000884dc0, {0xc000290ec0, 0x1a}, {0xc0008d3729, 0x5}, {0xc00037b440, 0x14, 0x300000000})
        /go/src/github.com/kubeflow/katib/pkg/controller.v1beta1/experiment/manifest/generator.go:81 +0x45
github.com/kubeflow/katib/pkg/controller.v1beta1/experiment.(*ReconcileExperiment).createTrialInstance(0xc000280480, 0xc000884dc0, 0xc000ac98a0)
        /go/src/github.com/kubeflow/katib/pkg/controller.v1beta1/experiment/experiment_controller_util.go:61 +0x2d0
github.com/kubeflow/katib/pkg/controller.v1beta1/experiment.(*ReconcileExperiment).createTrials(0xc0005cf8f0, 0xc000884dc0, {0x27879b8, 0x0, 0x8}, 0x8)
        /go/src/github.com/kubeflow/katib/pkg/controller.v1beta1/experiment/experiment_controller.go:358 +0x39a
github.com/kubeflow/katib/pkg/controller.v1beta1/experiment.(*ReconcileExperiment).ReconcileTrials(0xc0005a0910, 0xc000884dc0, {0x27879b8, 0x0, 0x0})
        /go/src/github.com/kubeflow/katib/pkg/controller.v1beta1/experiment/experiment_controller.go:335 +0x62c
github.com/kubeflow/katib/pkg/controller.v1beta1/experiment.(*ReconcileExperiment).ReconcileExperiment(0xc000280480, 0xc000884dc0)
        /go/src/github.com/kubeflow/katib/pkg/controller.v1beta1/experiment/experiment_controller.go:281 +0x2cf
github.com/kubeflow/katib/pkg/controller.v1beta1/experiment.(*ReconcileExperiment).Reconcile(0xc000280480, {0x1aee138, 0xc0004b74a0}, {{{0xc0008d3729, 0x5}, {0xc00062b7e8, 0x11}}})
        /go/src/github.com/kubeflow/katib/pkg/controller.v1beta1/experiment/experiment_controller.go:239 +0x61c
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc0007075e0, {0x1aee090, 0xc00092bd00}, {0x16cf5c0, 0xc000548ac0})
        /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:297 +0x303
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc0007075e0, {0x1aee090, 0xc00092bd00})
        /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:252 +0x205
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1.2({0x1aee090, 0xc00092bd00})
        /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:215 +0x46
k8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext.func1()
        /go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:185 +0x25
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x7f1895b22b98)
        /go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:155 +0x67
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0x0, {0x1ac26a0, 0xc0004b7440}, 0x1, 0xc000560540)
        /go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:156 +0xb6
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc000098ae0, 0x3b9aca00, 0x0, 0x20, 0x0)
        /go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:133 +0x89
k8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext({0x1aee090, 0xc00092bd00}, 0xc0005cf790, 0x0, 0x0, 0x20)
        /go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:185 +0x99
k8s.io/apimachinery/pkg/util/wait.UntilWithContext({0x1aee090, 0xc00092bd00}, 0xc0009adce0, 0xc0009eb728)
        /go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:99 +0x2b
created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1
        /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:212 +0x356

@misohu misohu added the bug Something isn't working label Aug 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants