-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AutoML jobs times out in pipeline until katib-controller is restarted #459
Comments
Hmm, interesting. I had the timeout 100% so far.
Yup, kfserving failure is expected.
I don't think of anything off the top of my head. I just used the default values and it shouldn't affect the pipeline execution since it just creates a pipeline not running it if I'm not mistaken. Just to confirm, have you used microk8s? |
Yes, I used microk8s 1.21/stable. |
Hmm, So the key to workaround the issue might be restarting/recreating the katib-controller pod. It's puzzling why it's reproducible on my testbed but not on the other environment though. |
The controller was restarted around 15:31.
|
I've reproduced it successfully in a clean environment. The steps are almost identical with the one in the description. Hope it helps for you to reproduce it on your end, @natalian98
Then open
|
Thanks for providing the detailed steps @nobuto-m. I deployed kubeflow once more using your instructions and the trials still succeed, provided that the notebook's CPU and memory values are increased. When using the default CPU==0.5 and memory==1Gi, some garbage collection errors can be observed, which may be the reason why the trials get stuck. Katib-controller wasn't restarted. After creating a notebook with 2 CPUs and 4Gi memory, the trials succeed. |
Thanks for testing. I bumped CPU and memory to 4 CPUs and 8Gi memory, but still it doesn't work for me. Also, I'm confused because I thought a notebook instance was irrelevant to the pipeline run. The pipeline is defined from the notebook, but the notebook instance can be deleted before re-running the pipeline if I'm not mistaken. So I'm wondering how the spec of the notebook instance affects the pipeline. Am I missing something?
Which log did you see this in? It might not be the notebook instance, but if there was any error from any component, we can dig in. |
I am experiencing similar failure. Katib-controller would deploy successfully. The unit would goes into crashloopbackoff after submitting an autoML experiment with all the default configurations (pod logs provided below). logs from katib-controller
|
How to reproduce (based on the quickstart doc):
juju bootstrap microk8s
juju add-model kubeflow
juju deploy --trust kubeflow
juju config dex-auth public-url=http://10.64.140.43.nip.io
juju config oidc-gatekeeper public-url=http://10.64.140.43.nip.io
juju config dex-auth static-username=admin
juju config dex-auth static-password=admin
access-ml-pipeline: "true"
for PodDefault)Allow access to Kubeflow Pipelines
in Kubeflow UIhttps://github.com/kubeflow/katib/blob/master/examples/v1beta1/kubeflow-pipelines/kubeflow-e2e-mnist.ipynb
Expected:
All trials are complete.
Solution/workaround:
Once
Workflow.v1alpha1.argoproj.io
is added to trial-resources of katib-controller. All trials are complete.https://github.com/kubeflow/katib/blob/master/examples/v1beta1/argo/README.md#katib-controller
Since both katib-controller and argo are managed by Juju charms, it would be good to see some improvements in this user scenario. For the record, Katib Metrics Collector sidecar injection is enabled out of the box.
OR
Simply restart katib-controller without changing anything. Adding
Workflow.v1alpha1.argoproj.io
might be a red herring since it actually recreate a pod.[out of the box]
[patched]
The text was updated successfully, but these errors were encountered: