Refactor Examples folder structure #1691

andreyvelich · 2021-10-02T00:32:07Z

As we discussed at the AutoML community meeting before, we should refactor our example structure to be more user-friendly.

This PR proposed the following folder structure for our examples:

argo - Katib with Argo integration.
early-stopping - Early Stopping Examples
fpga - Katib with FPGA integration.
hp-tuning - HP Tuning examples, what do you think about the naming (<algorithm-name>.yaml) ?
kind-cluster - Get Started example to run Katib from the local laptop. This was a popular request from the users.
kubeflow-pipelines - Katib with KFP.
kubeflow-training-operator - Katib with Training Operators.
metrics-collector - Different metrics collector examples
nas - NAS examples.
resume-experiment - Examples with resume experiment feature.
sdk - Katib Python SDK examples.
tekton - Katib with Tekton integration.

We should have discussion for the following directories (they might be not clear for the user):

trial-settings - Examples with the various Trial template specification
trial-training-containers - Training container examples.

I updated the docs/links with the new examples.

Please give your feedback on this PR since it is a very significant change and we have to be clear for our users.

/cc @kubeflow/wg-training-leads @kimwnasptd @jbottum @shannonbradshaw @anencore94 @tenzen-y @c-bata @g-votte @eliaskoromilas @jstamel @knkski

review-notebook-app · 2021-10-02T00:32:11Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

google-oss-robot · 2021-10-02T00:32:14Z

@andreyvelich: GitHub didn't allow me to request PR reviews from the following users: g-votte.

Note that only kubeflow members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

As we discussed at the AutoML community meeting before, we should refactor our example structure to be more user-friendly.

This PR proposed the following folder structure for our examples:

argo - Katib with Argo integration.

early-stopping - Early Stopping Examples

fpga - Katib with FPGA integration.

hp-tuning - HP Tuning examples, what do you think about the naming (<algorithm-name>.yaml) ?

kind-cluster - Get Started example to run Katib from the local laptop. This was a popular request from the users.

kubeflow-pipelines - Katib with KFP.

kubeflow-training-operator - Katib with Training Operators.

metrics-collector - Different metrics collector examples

nas - NAS examples.

resume-experiment - Examples with resume experiment feature.

sdk - Katib Python SDK examples.

tekton - Katib with Tekton integration.

We should have discussion for the following directories (they might be not clear for the user):

trial-settings - Examples with the various Trial template specification

trial-training-containers - Training container examples.

I updated the docs/links with the new examples.

Please give your feedback on this PR since it is a very significant change and we have to be clear for our users.

/cc @kubeflow/wg-training-leads @kimwnasptd @jbottum @shannonbradshaw @anencore94 @tenzen-y @c-bata @g-votte @eliaskoromilas @jstamel @knkski

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

google-oss-robot · 2021-10-02T00:32:22Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andreyvelich

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [andreyvelich]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

tenzen-y

Great work!

andreyvelich · 2021-10-04T13:25:41Z

examples/v1beta1/README.md

+Katib has out of the box support for the [Kubeflow Training Operators](https://github.com/kubeflow/tf-operator) to
+perform Trial's [Worker job](https://www.kubeflow.org/docs/components/katib/overview/#trial).
+Check the following examples for the various distributed operators:


I think we also supported the previous versions for some of them.
For example, TFJob and PyTorchJob was integrated in 2018 in Katib.

I think we also supported the previous versions for some of them.

It makes sense.

I think It needs to describe the training-operator version since I guess it only works with training-operator >= 1.3 in the following examples. WDYT @andreyvelich ?

katib/examples/v1beta1/kubeflow-training-operator/mpijob-horovod.yaml

Lines 41 to 43 in 9f15ec5

trialSpec:

apiVersion: kubeflow.org/v1

kind: MPIJob

katib/examples/v1beta1/kubeflow-training-operator/pytorchjob-mnist.yaml

Lines 36 to 38 in 9f15ec5

trialSpec:

apiVersion: kubeflow.org/v1

kind: PyTorchJob

katib/examples/v1beta1/kubeflow-training-operator/tfjob-mnist-with-summaries.yaml

Lines 43 to 45 in 9f15ec5

trialSpec:

apiVersion: kubeflow.org/v1

kind: TFJob

katib/examples/v1beta1/kubeflow-training-operator/xgboostjob-lightgbm.yaml

Lines 46 to 48 in 9f15ec5

trialSpec:

apiVersion: kubeflow.org/v1

kind: XGBoostJob

For TFJob and PyTorchJob it should work for 1.1 and 1.2 release also, but we can keep support only for >= 1.3 version.
@kubeflow/wg-training-leads Do we want to keep support for previous versions for Training Operators in Katib ?

Sorry, @andreyvelich.
I was able to verify that the above examples work with pytorch-operator. I withdraw the following comment.

#1691 (comment)

I think It needs to describe the training-operator version since I guess it only works with train-operator >= 1.3 in the following examples. WDYT @andreyvelich ?

examples/v1beta1/kind-cluster/README.md

examples/v1beta1/kind-cluster/deploy.sh

examples/v1beta1/README.md

docs/images-location.md

google-cla · 2021-10-04T13:37:55Z

We found a Contributor License Agreement for you (the sender of this pull request), but were unable to find agreements for all the commit author(s) or Co-authors. If you authored these, maybe you used a different email address in the git commits than was used to sign the CLA (login here to double check)? If these were authored by someone else, then they will need to sign a CLA as well, and confirm that they're okay with these being contributed to Google.
In order to pass this check, please resolve this problem and then comment @googlebot I fixed it.. If the bot doesn't comment, it means it doesn't think anything has changed.

ℹ️ Googlers: Go here for more info.

google-cla · 2021-10-04T13:39:49Z

We found a Contributor License Agreement for you (the sender of this pull request), but were unable to find agreements for all the commit author(s) or Co-authors. If you authored these, maybe you used a different email address in the git commits than was used to sign the CLA (login here to double check)? If these were authored by someone else, then they will need to sign a CLA as well, and confirm that they're okay with these being contributed to Google.
In order to pass this check, please resolve this problem and then comment @googlebot I fixed it.. If the bot doesn't comment, it means it doesn't think anything has changed.

ℹ️ Googlers: Go here for more info.

google-cla · 2021-10-04T13:39:49Z

We found a Contributor License Agreement for you (the sender of this pull request), but were unable to find agreements for all the commit author(s) or Co-authors. If you authored these, maybe you used a different email address in the git commits than was used to sign the CLA (login here to double check)? If these were authored by someone else, then they will need to sign a CLA as well, and confirm that they're okay with these being contributed to Google.
In order to pass this check, please resolve this problem and then comment @googlebot I fixed it.. If the bot doesn't comment, it means it doesn't think anything has changed.

ℹ️ Googlers: Go here for more info.

tenzen-y · 2021-10-04T13:40:44Z

@googlebot I fixed it.

google-cla · 2021-10-04T13:41:02Z

We found a Contributor License Agreement for you (the sender of this pull request), but were unable to find agreements for all the commit author(s) or Co-authors. If you authored these, maybe you used a different email address in the git commits than was used to sign the CLA (login here to double check)? If these were authored by someone else, then they will need to sign a CLA as well, and confirm that they're okay with these being contributed to Google.
In order to pass this check, please resolve this problem and then comment @googlebot I fixed it.. If the bot doesn't comment, it means it doesn't think anything has changed.

ℹ️ Googlers: Go here for more info.

tenzen-y · 2021-10-04T13:50:37Z

@googlebot I fixed it.

examples/v1beta1/kind-cluster/README.md

tenzen-y · 2021-10-04T17:57:50Z

lgtm

anencore94 · 2021-10-05T06:55:20Z

examples/v1beta1/README.md

+
+- [Metrics Collection Strategy](./metrics-collector/metrics-collection-strategy.yaml)
+
+## TODO (andreyvelich) Discuss about this name. Trial Settings


What about just Trial Template ? or Setup Trial Template ?
I think including the word template would be better.

In that case, does Trial Training Containers folder where we add training code sounds good ?

WDYT @johnugeorge about the name trial-template instead of trial-settings ?

that sounds better

examples/v1beta1/README.md

anencore94 · 2021-10-05T07:05:01Z

examples/v1beta1/README.md

+## FPGA Support in Katib Experiments
+
+You can run Katib Experiments on [FPGA](https://en.wikipedia.org/wiki/Field-programmable_gate_array)
+based instances. For more information check [these examples](./fpga).


I think this parts should have one more hierarchy deeper.
Such as FPGA Support under Use cases in Katib Experiments.
If there came more use cases, like asic placement example, computer vision example, nlp example, ..., current structure leaves all examples in this README.md

That's good idea, what do you think @johnugeorge @gaocegege @eliaskoromilas ?

I would preserve the directory structure, keeping all the use-cases directly under examples/v1beta1. Grouping them in this README sounds good.

I think we should decide what is use-cases, does Argo and Tekton Experiment also relates to the use-cases ?
Maybe we should decide about it in the following issue and keep the structure as it is for now ?

WDYT @johnugeorge @eliaskoromilas @anencore94 ?

Personally, I think the word use-case would be for domain dependency.
Something like Katib Experiment example for FPGA, Katib Experiment for BERT, Katib Experiment for Deepfake, and so on.

Argo and Tekton Experiments belong to represent the variety of trialSpecs

anencore94 · 2021-10-05T07:08:31Z

examples/v1beta1/metrics-collector/custom-metrics-collector.yaml

 # TODO (andreyvelich) This metrics collector image (kubeflowkatib/custom-metrics-collector) doesn't work in v1beta1.
 # It is currently using api.v1.alpha3.Manager instead of api.v1.beta1.Manager to report metrics.


Does this issue resolved ?

Not yet, this is the tracking issue: #1263.

Thanks :) I got it.

eliaskoromilas · 2021-10-06T19:13:39Z

@andreyvelich I just noticed a typo in fpga/README.md (~~Serice~~ Service).

https://github.com/andreyvelich/katib/tree/refactor-example-folder/examples/v1beta1/fpga#simplifying-fpga-management-in-eks-elastic-kubernetes-serice

Feel free to fix this.

andreyvelich · 2021-10-06T19:23:11Z

@andreyvelich I just noticed a typo in fpga/README.md (~~Serice~~ Service).

https://github.com/andreyvelich/katib/tree/refactor-example-folder/examples/v1beta1/fpga#simplifying-fpga-management-in-eks-elastic-kubernetes-serice

Feel free to fix this.

I think I fixed this here: #1688.

anencore94 · 2021-10-07T04:15:19Z

Based on the feedback and discussion, I followed this way:

trial-template: Examples with Trial template modifications.
trial-images: Trial images which are located in Katib repository (I moved NAS Trials there also).

It looks better, Thanks!

tenzen-y · 2021-10-07T05:24:03Z

@googlebot I consent.

google-cla · 2021-10-07T05:24:24Z

All (the pull request submitter and all commit authors) CLAs are signed, but one or more commits were authored or co-authored by someone other than the pull request submitter.

We need to confirm that all authors are ok with their commits being contributed to this project. Please have them confirm that by leaving a comment that contains only @googlebot I consent. in this pull request.

Note to project maintainer: There may be cases where the author cannot leave a comment, or the comment is not properly detected as consent. In those cases, you can manually confirm consent of the commit author(s), and set the cla label to yes (if enabled on your project).

ℹ️ Googlers: Go here for more info.

tenzen-y · 2021-10-07T05:25:12Z

Based on the feedback and discussion, I followed this way:

trial-template: Examples with Trial template modifications.

trial-images: Trial images which are located in Katib repository (I moved NAS Trials there also).

@johnugeorge @tenzen-y @eliaskoromilas @anencore94 Please let me know what do you think about it ? If it sounds good, we can merge this initial examples change.

/hold for the review

Sounds good.
/retest

eliaskoromilas · 2021-10-07T05:31:28Z

@johnugeorge @tenzen-y @eliaskoromilas @anencore94 Please let me know what do you think about it ? If it sounds good, we can merge this initial examples change.

/lgtm

tenzen-y

I think the Paths for mxnet and pytorch are wrong.

pytorch

INFO[0003] Unpacking rootfs as cmd ADD examples/v1beta1/trial-training-containers/pytorch-mnist /opt/pytorch-mnist requires it.
error building image: error building stage: failed to get files used from context: failed to get fileinfo for /mnt/test-data-volume/kubeflow-katib-presubmit-e2e-v1beta1-1691-ce8
da88-8960-c513/src/github.com/kubeflow/katib/examples/v1beta1/trial-training-containers/pytorch-mnist: lstat /mnt/test-data-volume/kubeflow-katib-presubmit-e2e-v1beta1-1691-ce8d
a88-8960-c513/src/github.com/kubeflow/katib/examples/v1beta1/trial-training-containers/pytorch-mnist: no such file or directory

mxnet

INFO[0003] Unpacking rootfs as cmd ADD examples/v1beta1/trial-training-containers/mxnet-mnist /opt/mxnet-mnist requires it.
error building image: error building stage: failed to get files used from context: failed to get fileinfo for /mnt/test-data-volume/kubeflow-katib-presubmit-e2e-v1beta1-1691-ce8
da88-8960-c513/src/github.com/kubeflow/katib/examples/v1beta1/trial-training-containers/mxnet-mnist: lstat /mnt/test-data-volume/kubeflow-katib-presubmit-e2e-v1beta1-1691-ce8da8
8-8960-c513/src/github.com/kubeflow/katib/examples/v1beta1/trial-training-containers/mxnet-mnist: no such file or directory

examples/v1beta1/trial-images/mxnet-mnist/Dockerfile

examples/v1beta1/trial-images/pytorch-mnist/Dockerfile

Co-authored-by: Yuki Iwai <[email protected]>

andreyvelich · 2021-10-07T11:31:11Z

Nice catch @tenzen-y!

google-cla · 2021-10-07T11:31:14Z

All (the pull request submitter and all commit authors) CLAs are signed, but one or more commits were authored or co-authored by someone other than the pull request submitter.

We need to confirm that all authors are ok with their commits being contributed to this project. Please have them confirm that by leaving a comment that contains only @googlebot I consent. in this pull request.

Note to project maintainer: There may be cases where the author cannot leave a comment, or the comment is not properly detected as consent. In those cases, you can manually confirm consent of the commit author(s), and set the cla label to yes (if enabled on your project).

ℹ️ Googlers: Go here for more info.

google-cla · 2021-10-07T11:31:21Z

All (the pull request submitter and all commit authors) CLAs are signed, but one or more commits were authored or co-authored by someone other than the pull request submitter.

We need to confirm that all authors are ok with their commits being contributed to this project. Please have them confirm that by leaving a comment that contains only @googlebot I consent. in this pull request.

Note to project maintainer: There may be cases where the author cannot leave a comment, or the comment is not properly detected as consent. In those cases, you can manually confirm consent of the commit author(s), and set the cla label to yes (if enabled on your project).

ℹ️ Googlers: Go here for more info.

tenzen-y · 2021-10-07T11:31:55Z

@googlebot I consent.

google-cla · 2021-10-07T11:32:15Z

All (the pull request submitter and all commit authors) CLAs are signed, but one or more commits were authored or co-authored by someone other than the pull request submitter.

We need to confirm that all authors are ok with their commits being contributed to this project. Please have them confirm that by leaving a comment that contains only @googlebot I consent. in this pull request.

Note to project maintainer: There may be cases where the author cannot leave a comment, or the comment is not properly detected as consent. In those cases, you can manually confirm consent of the commit author(s), and set the cla label to yes (if enabled on your project).

ℹ️ Googlers: Go here for more info.

andreyvelich · 2021-10-07T11:37:00Z

@googlebot I consent.

google-cla · 2021-10-07T11:37:21Z

All (the pull request submitter and all commit authors) CLAs are signed, but one or more commits were authored or co-authored by someone other than the pull request submitter.

We need to confirm that all authors are ok with their commits being contributed to this project. Please have them confirm that by leaving a comment that contains only @googlebot I consent. in this pull request.

Note to project maintainer: There may be cases where the author cannot leave a comment, or the comment is not properly detected as consent. In those cases, you can manually confirm consent of the commit author(s), and set the cla label to yes (if enabled on your project).

ℹ️ Googlers: Go here for more info.

andreyvelich · 2021-10-07T11:50:45Z

@eliaskoromilas Please can you comment @googlebot I consent.

google-cla · 2021-10-07T11:51:13Z

All (the pull request submitter and all commit authors) CLAs are signed, but one or more commits were authored or co-authored by someone other than the pull request submitter.

We need to confirm that all authors are ok with their commits being contributed to this project. Please have them confirm that by leaving a comment that contains only @googlebot I consent. in this pull request.

Note to project maintainer: There may be cases where the author cannot leave a comment, or the comment is not properly detected as consent. In those cases, you can manually confirm consent of the commit author(s), and set the cla label to yes (if enabled on your project).

ℹ️ Googlers: Go here for more info.

tenzen-y · 2021-10-07T12:24:57Z

/lgtm

eliaskoromilas · 2021-10-07T12:30:37Z

@googlebot I consent.

andreyvelich · 2021-10-07T12:37:16Z

Thanks everyone for help on this PR!
/hold cancel

google-oss-robot added the do-not-merge/work-in-progress label Oct 2, 2021

google-oss-robot requested review from eliaskoromilas, kimwnasptd, shannonbradshaw and c-bata October 2, 2021 00:32

google-oss-robot requested review from tenzen-y, jstamel, knkski and jbottum October 2, 2021 00:32

google-oss-robot requested a review from anencore94 October 2, 2021 00:32

google-oss-robot added size/XXL approved labels Oct 2, 2021

tenzen-y reviewed Oct 3, 2021

View reviewed changes

eliaskoromilas reviewed Oct 3, 2021

View reviewed changes

examples/v1beta1/README.md Outdated Show resolved Hide resolved

eliaskoromilas reviewed Oct 3, 2021

View reviewed changes

docs/images-location.md Outdated Show resolved Hide resolved

tenzen-y reviewed Oct 4, 2021

View reviewed changes

examples/v1beta1/kind-cluster/README.md Outdated Show resolved Hide resolved

anencore94 reviewed Oct 5, 2021

View reviewed changes

andreyvelich changed the title ~~[WIP] Refactor Examples folder structure~~ Refactor Examples folder structure Oct 6, 2021

google-oss-robot assigned eliaskoromilas Oct 7, 2021

google-oss-robot added the lgtm label Oct 7, 2021

tenzen-y reviewed Oct 7, 2021

View reviewed changes

examples/v1beta1/trial-images/mxnet-mnist/Dockerfile Outdated Show resolved Hide resolved

examples/v1beta1/trial-images/pytorch-mnist/Dockerfile Outdated Show resolved Hide resolved

Update examples/v1beta1/trial-images/mxnet-mnist/Dockerfile

8881113

Co-authored-by: Yuki Iwai <[email protected]>

google-oss-robot removed the lgtm label Oct 7, 2021

Update examples/v1beta1/trial-images/pytorch-mnist/Dockerfile

2dd70ff

Co-authored-by: Yuki Iwai <[email protected]>

google-oss-robot assigned tenzen-y Oct 7, 2021

google-oss-robot added the lgtm label Oct 7, 2021

google-oss-robot removed the do-not-merge/hold label Oct 7, 2021

google-oss-robot merged commit 983a867 into kubeflow:master Oct 7, 2021

andreyvelich deleted the refactor-example-folder branch October 7, 2021 12:43

This was referenced Oct 7, 2021

Katib: Fix links for all examples kubeflow/website#3018

Merged

Implement some unit tests for the katibconfig package #1690

Merged


		- [Metrics Collection Strategy](./metrics-collector/metrics-collection-strategy.yaml)

		## TODO (andreyvelich) Discuss about this name. Trial Settings

		# TODO (andreyvelich) This metrics collector image (kubeflowkatib/custom-metrics-collector) doesn't work in v1beta1.
		# It is currently using api.v1.alpha3.Manager instead of api.v1.beta1.Manager to report metrics.

Refactor Examples folder structure #1691

Refactor Examples folder structure #1691

Conversation

andreyvelich commented Oct 2, 2021

We should have discussion for the following directories (they might be not clear for the user):

review-notebook-app bot commented Oct 2, 2021

google-oss-robot commented Oct 2, 2021

We should have discussion for the following directories (they might be not clear for the user):

google-oss-robot commented Oct 2, 2021

tenzen-y left a comment

Choose a reason for hiding this comment

This comment was marked as resolved.

Choose a reason for hiding this comment

tenzen-y Oct 4, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

google-cla bot commented Oct 4, 2021

google-cla bot commented Oct 4, 2021

google-cla bot commented Oct 4, 2021

tenzen-y commented Oct 4, 2021

google-cla bot commented Oct 4, 2021

tenzen-y commented Oct 4, 2021

tenzen-y commented Oct 4, 2021

Choose a reason for hiding this comment

andreyvelich Oct 5, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andreyvelich Oct 6, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eliaskoromilas commented Oct 6, 2021

andreyvelich commented Oct 6, 2021

anencore94 commented Oct 7, 2021

tenzen-y commented Oct 7, 2021

google-cla bot commented Oct 7, 2021

tenzen-y commented Oct 7, 2021

eliaskoromilas commented Oct 7, 2021

tenzen-y left a comment • edited Loading

Choose a reason for hiding this comment

andreyvelich commented Oct 7, 2021

google-cla bot commented Oct 7, 2021

google-cla bot commented Oct 7, 2021

tenzen-y commented Oct 7, 2021

google-cla bot commented Oct 7, 2021

andreyvelich commented Oct 7, 2021

google-cla bot commented Oct 7, 2021

andreyvelich commented Oct 7, 2021

google-cla bot commented Oct 7, 2021

tenzen-y commented Oct 7, 2021

eliaskoromilas commented Oct 7, 2021

andreyvelich commented Oct 7, 2021

tenzen-y Oct 4, 2021 •

edited

Loading

andreyvelich Oct 5, 2021 •

edited

Loading

andreyvelich Oct 6, 2021 •

edited

Loading

tenzen-y left a comment •

edited

Loading