Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AutoML WG and Kubeflow 1.5 release #2106

Closed
DnPlas opened this issue Jan 19, 2022 · 13 comments
Closed

AutoML WG and Kubeflow 1.5 release #2106

DnPlas opened this issue Jan 19, 2022 · 13 comments

Comments

@DnPlas
Copy link
Contributor

DnPlas commented Jan 19, 2022

@kubeflow/wg-automl-leads let's use this tracking issue to coordinate the integration of AutoML with the Kubeflow 1.5 release.

First off a heads up that the feature freeze phase will start Tuesday (25th January). Before then I'd like to have updated this repo with the manifests of the kubeflow/katib repo, in order to be able to cut the first RC tag in this repo.

So what I'd like to ask as a first step before the feature freeze is:

  1. What version of Katib would you like to include for the 1.5 release?
  2. Could you provide me with a branch/tag for this version? It doesn't have to be final. The branch/tag provided can keep on getting fixes through out the release process, but not new features
  3. Are there any open issues/work in progress that you will be working on for your version as the KF release process will be progressing?
  4. What will the K8s supported versions be for kubeflow/katib?
This was referenced Jan 24, 2022
@kimwnasptd
Copy link
Member

From the versioning issue we had we know we are targeting 0.13 #2098 (comment). @kubeflow/wg-automl-leads let's use this issue for further updates, new tags, progressing issues etc.

@DomFleischmann
Copy link
Contributor

Hi @kubeflow/wg-automl-leads , Before the manifest testing on Wednesday, Feb 9th, the release team is planning on cutting another RC to use for the testing.

Based on a previous communication, the release team will be using AutoML version 0.13rc0. If the AutoML WG have identified any issues since the feature freeze and would like to update the AutoML version before the manifest testing, let us know before Feb. 9th. Thank you!

@andreyvelich

@kimwnasptd
Copy link
Member

After syncing in today's AutoML we will keep on using the 0.13-rc0 tag, for the RC1 of the Manifests. A newer RC might be cut for the kubeflow/katib repo later on, in case more issues arise.

Also another note, the @kubeflow/wg-automl-leads will update the kubeflow/katib e2e tests to be using the v1.5-branch branch of the manifests. This means that the e2e tests will be using the latest training operators, so we'll be keeping an eye on issues that might arise.

@yhwang
Copy link
Member

yhwang commented Feb 9, 2022

deployed kubeflow from v1.5-branch and ran this example: https://github.com/kubeflow/katib/blob/master/examples/v1beta1/kubeflow-pipelines/kubeflow-e2e-mnist.ipynb
I encountered this issue: kubeflow/katib#1795

I found the metric collector is not injected into the trial pod:

mnist-e2e-jxnc28x2-chief-0                                        0/1     Completed 
mnist-e2e-jxnc28x2-worker-0                                       0/1     Completed

Does anyone have the same issue? not sure if this is the right place to discuss/report this.

BTW, early-stop sample works well and I do see metric collector container was injected:

median-stop-new2-nxh6jbn7-h7h48                                   0/2     Completed 

@kimwnasptd
Copy link
Member

Thanks for raising this @yhwang! I also bumped into this when writing the e2e tests

The fix for this should be to use training.kubeflow.org/job-role: master as the PrimaryPodLabel. Here's how I did it in the codified version of the above notebook:
https://github.com/kubeflow/manifests/pull/2128/files#diff-ba317d8735e3ac6c584fe8dc196fddb304ad5e548b94599c35eeb59bcfa8e89eR159

We also discussed this in this week's AutoML meeting, and we'll expose the full list of annotations/changes users need to keep in mind for the new 1.4 version of the Training Operators.

@yhwang
Copy link
Member

yhwang commented Feb 10, 2022

thanks @kimwnasptd I tried training.kubeflow.org/job-role: master and the metric collector is injected. however, it only finished 1st trial, and no more sequential trial was scheduled. The experiment is still in the running state but no more progress. do you have the same issue?

@kimwnasptd
Copy link
Member

Haven't bumped into this, in my case with a KinD 1.20 cluster all the trials got to Succeeded state after running the test https://github.com/kubeflow/manifests/blob/master/tests/e2e/runner.sh.

Can you open a distinct issue in the kubeflow/katib so that we can get more deep into it?

I'll also start using Prow for the e2e tests with AWS clusters in the manifests repo, I'll give a heads up if I bump into this.

@yhwang
Copy link
Member

yhwang commented Feb 15, 2022

forgot to update you on my latest status of katib. the problem seems to be a tfjob from previous run got stuck in a weird state. after I removed that job, my katib works well. thanks for the script and hint.

@kimwnasptd
Copy link
Member

@andreyvelich @johnugeorge @gaocegege I'm working on finalizing the manifests for the release, as we are getting closer to the release date of March 9th.

Regarding the kubeflow/katib repo, when are you planning to cut the final v0.13 tag? Could you do it within this week so that we can get the manifests closer to their final state?

@johnugeorge
Copy link
Member

johnugeorge commented Mar 1, 2022

@kimwnasptd . we will do it this week

@kimwnasptd
Copy link
Member

Just saw it's ready. Congrats on the release 🎉

@shannonbradshaw
Copy link

shannonbradshaw commented Mar 7, 2022

Hey folks, any docs changes required as a result of this work? Please create an issue and mention it on this tracking issue.
kubeflow/website#3130

@DnPlas
Copy link
Contributor Author

DnPlas commented Apr 25, 2023

This effort has been finalised.

@DnPlas DnPlas closed this as completed Apr 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants