-
Notifications
You must be signed in to change notification settings - Fork 442
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor Examples folder structure #1691
Refactor Examples folder structure #1691
Conversation
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
@andreyvelich: GitHub didn't allow me to request PR reviews from the following users: g-votte. Note that only kubeflow members and repo collaborators can review this PR, and authors cannot review their own PRs. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: andreyvelich The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work!
examples/v1beta1/README.md
Outdated
Katib has out of the box support for the [Kubeflow Training Operators](https://github.com/kubeflow/tf-operator) to | ||
perform Trial's [Worker job](https://www.kubeflow.org/docs/components/katib/overview/#trial). | ||
Check the following examples for the various distributed operators: |
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we also supported the previous versions for some of them.
For example, TFJob and PyTorchJob was integrated in 2018 in Katib.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we also supported the previous versions for some of them.
It makes sense.
I think It needs to describe the training-operator version since I guess it only works with training-operator >= 1.3 in the following examples. WDYT @andreyvelich ?
katib/examples/v1beta1/kubeflow-training-operator/mpijob-horovod.yaml
Lines 41 to 43 in 9f15ec5
trialSpec: | |
apiVersion: kubeflow.org/v1 | |
kind: MPIJob |
trialSpec: | |
apiVersion: kubeflow.org/v1 | |
kind: PyTorchJob |
katib/examples/v1beta1/kubeflow-training-operator/tfjob-mnist-with-summaries.yaml
Lines 43 to 45 in 9f15ec5
trialSpec: | |
apiVersion: kubeflow.org/v1 | |
kind: TFJob |
katib/examples/v1beta1/kubeflow-training-operator/xgboostjob-lightgbm.yaml
Lines 46 to 48 in 9f15ec5
trialSpec: | |
apiVersion: kubeflow.org/v1 | |
kind: XGBoostJob |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For TFJob and PyTorchJob it should work for 1.1 and 1.2 release also, but we can keep support only for >= 1.3 version.
@kubeflow/wg-training-leads Do we want to keep support for previous versions for Training Operators in Katib ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, @andreyvelich.
I was able to verify that the above examples work with pytorch-operator. I withdraw the following comment.
I think It needs to describe the training-operator version since I guess it only works with train-operator >= 1.3 in the following examples. WDYT @andreyvelich ?
We found a Contributor License Agreement for you (the sender of this pull request), but were unable to find agreements for all the commit author(s) or Co-authors. If you authored these, maybe you used a different email address in the git commits than was used to sign the CLA (login here to double check)? If these were authored by someone else, then they will need to sign a CLA as well, and confirm that they're okay with these being contributed to Google. ℹ️ Googlers: Go here for more info. |
2 similar comments
We found a Contributor License Agreement for you (the sender of this pull request), but were unable to find agreements for all the commit author(s) or Co-authors. If you authored these, maybe you used a different email address in the git commits than was used to sign the CLA (login here to double check)? If these were authored by someone else, then they will need to sign a CLA as well, and confirm that they're okay with these being contributed to Google. ℹ️ Googlers: Go here for more info. |
We found a Contributor License Agreement for you (the sender of this pull request), but were unable to find agreements for all the commit author(s) or Co-authors. If you authored these, maybe you used a different email address in the git commits than was used to sign the CLA (login here to double check)? If these were authored by someone else, then they will need to sign a CLA as well, and confirm that they're okay with these being contributed to Google. ℹ️ Googlers: Go here for more info. |
@googlebot I fixed it. |
We found a Contributor License Agreement for you (the sender of this pull request), but were unable to find agreements for all the commit author(s) or Co-authors. If you authored these, maybe you used a different email address in the git commits than was used to sign the CLA (login here to double check)? If these were authored by someone else, then they will need to sign a CLA as well, and confirm that they're okay with these being contributed to Google. ℹ️ Googlers: Go here for more info. |
@googlebot I fixed it. |
lgtm |
examples/v1beta1/README.md
Outdated
|
||
- [Metrics Collection Strategy](./metrics-collector/metrics-collection-strategy.yaml) | ||
|
||
## TODO (andreyvelich) Discuss about this name. Trial Settings |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about just Trial Template ? or Setup Trial Template ?
I think including the word template
would be better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that case, does Trial Training Containers folder where we add training code sounds good ?
WDYT @johnugeorge about the name trial-template
instead of trial-settings
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that sounds better
## FPGA Support in Katib Experiments | ||
|
||
You can run Katib Experiments on [FPGA](https://en.wikipedia.org/wiki/Field-programmable_gate_array) | ||
based instances. For more information check [these examples](./fpga). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this parts should have one more hierarchy deeper.
Such as FPGA Support
under Use cases in Katib Experiments
.
If there came more use cases, like asic placement example, computer vision example, nlp example, ..., current structure leaves all examples in this README.md
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's good idea, what do you think @johnugeorge @gaocegege @eliaskoromilas ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would preserve the directory structure, keeping all the use-cases directly under examples/v1beta1
. Grouping them in this README
sounds good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should decide what is use-cases
, does Argo and Tekton Experiment also relates to the use-cases ?
Maybe we should decide about it in the following issue and keep the structure as it is for now ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally, I think the word use-case would be for domain dependency.
Something like Katib Experiment example for FPGA, Katib Experiment for BERT, Katib Experiment for Deepfake, and so on.
Argo and Tekton Experiments belong to represent the variety of trialSpecs
# TODO (andreyvelich) This metrics collector image (kubeflowkatib/custom-metrics-collector) doesn't work in v1beta1. | ||
# It is currently using api.v1.alpha3.Manager instead of api.v1.beta1.Manager to report metrics. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this issue resolved ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not yet, this is the tracking issue: #1263.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks :) I got it.
@andreyvelich I just noticed a typo in Feel free to fix this. |
I think I fixed this here: #1688. |
It looks better, Thanks! |
@googlebot I consent. |
All (the pull request submitter and all commit authors) CLAs are signed, but one or more commits were authored or co-authored by someone other than the pull request submitter. We need to confirm that all authors are ok with their commits being contributed to this project. Please have them confirm that by leaving a comment that contains only Note to project maintainer: There may be cases where the author cannot leave a comment, or the comment is not properly detected as consent. In those cases, you can manually confirm consent of the commit author(s), and set the ℹ️ Googlers: Go here for more info. |
Sounds good. |
/lgtm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the Paths for mxnet and pytorch are wrong.
INFO[0003] Unpacking rootfs as cmd ADD examples/v1beta1/trial-training-containers/pytorch-mnist /opt/pytorch-mnist requires it.
error building image: error building stage: failed to get files used from context: failed to get fileinfo for /mnt/test-data-volume/kubeflow-katib-presubmit-e2e-v1beta1-1691-ce8
da88-8960-c513/src/github.com/kubeflow/katib/examples/v1beta1/trial-training-containers/pytorch-mnist: lstat /mnt/test-data-volume/kubeflow-katib-presubmit-e2e-v1beta1-1691-ce8d
a88-8960-c513/src/github.com/kubeflow/katib/examples/v1beta1/trial-training-containers/pytorch-mnist: no such file or directory
INFO[0003] Unpacking rootfs as cmd ADD examples/v1beta1/trial-training-containers/mxnet-mnist /opt/mxnet-mnist requires it.
error building image: error building stage: failed to get files used from context: failed to get fileinfo for /mnt/test-data-volume/kubeflow-katib-presubmit-e2e-v1beta1-1691-ce8
da88-8960-c513/src/github.com/kubeflow/katib/examples/v1beta1/trial-training-containers/mxnet-mnist: lstat /mnt/test-data-volume/kubeflow-katib-presubmit-e2e-v1beta1-1691-ce8da8
8-8960-c513/src/github.com/kubeflow/katib/examples/v1beta1/trial-training-containers/mxnet-mnist: no such file or directory
Co-authored-by: Yuki Iwai <[email protected]>
Co-authored-by: Yuki Iwai <[email protected]>
Nice catch @tenzen-y! |
All (the pull request submitter and all commit authors) CLAs are signed, but one or more commits were authored or co-authored by someone other than the pull request submitter. We need to confirm that all authors are ok with their commits being contributed to this project. Please have them confirm that by leaving a comment that contains only Note to project maintainer: There may be cases where the author cannot leave a comment, or the comment is not properly detected as consent. In those cases, you can manually confirm consent of the commit author(s), and set the ℹ️ Googlers: Go here for more info. |
1 similar comment
All (the pull request submitter and all commit authors) CLAs are signed, but one or more commits were authored or co-authored by someone other than the pull request submitter. We need to confirm that all authors are ok with their commits being contributed to this project. Please have them confirm that by leaving a comment that contains only Note to project maintainer: There may be cases where the author cannot leave a comment, or the comment is not properly detected as consent. In those cases, you can manually confirm consent of the commit author(s), and set the ℹ️ Googlers: Go here for more info. |
@googlebot I consent. |
All (the pull request submitter and all commit authors) CLAs are signed, but one or more commits were authored or co-authored by someone other than the pull request submitter. We need to confirm that all authors are ok with their commits being contributed to this project. Please have them confirm that by leaving a comment that contains only Note to project maintainer: There may be cases where the author cannot leave a comment, or the comment is not properly detected as consent. In those cases, you can manually confirm consent of the commit author(s), and set the ℹ️ Googlers: Go here for more info. |
@googlebot I consent. |
All (the pull request submitter and all commit authors) CLAs are signed, but one or more commits were authored or co-authored by someone other than the pull request submitter. We need to confirm that all authors are ok with their commits being contributed to this project. Please have them confirm that by leaving a comment that contains only Note to project maintainer: There may be cases where the author cannot leave a comment, or the comment is not properly detected as consent. In those cases, you can manually confirm consent of the commit author(s), and set the ℹ️ Googlers: Go here for more info. |
@eliaskoromilas Please can you comment |
All (the pull request submitter and all commit authors) CLAs are signed, but one or more commits were authored or co-authored by someone other than the pull request submitter. We need to confirm that all authors are ok with their commits being contributed to this project. Please have them confirm that by leaving a comment that contains only Note to project maintainer: There may be cases where the author cannot leave a comment, or the comment is not properly detected as consent. In those cases, you can manually confirm consent of the commit author(s), and set the ℹ️ Googlers: Go here for more info. |
/lgtm |
@googlebot I consent. |
Thanks everyone for help on this PR! |
As we discussed at the AutoML community meeting before, we should refactor our example structure to be more user-friendly.
This PR proposed the following folder structure for our examples:
argo
- Katib with Argo integration.early-stopping
- Early Stopping Examplesfpga
- Katib with FPGA integration.hp-tuning
- HP Tuning examples, what do you think about the naming (<algorithm-name>.yaml
) ?kind-cluster
- Get Started example to run Katib from the local laptop. This was a popular request from the users.kubeflow-pipelines
- Katib with KFP.kubeflow-training-operator
- Katib with Training Operators.metrics-collector
- Different metrics collector examplesnas
- NAS examples.resume-experiment
- Examples with resume experiment feature.sdk
- Katib Python SDK examples.tekton
- Katib with Tekton integration.We should have discussion for the following directories (they might be not clear for the user):
trial-settings
- Examples with the various Trial template specificationtrial-training-containers
- Training container examples.I updated the docs/links with the new examples.
Please give your feedback on this PR since it is a very significant change and we have to be clear for our users.
/cc @kubeflow/wg-training-leads @kimwnasptd @jbottum @shannonbradshaw @anencore94 @tenzen-y @c-bata @g-votte @eliaskoromilas @jstamel @knkski