-
Notifications
You must be signed in to change notification settings - Fork 768
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add resume policy instructions for Katib experiments #2324
Add resume policy instructions for Katib experiments #2324
Conversation
|
||
- For a detailed instruction of the Katib Configuration file, | ||
read the [Katib config page](/docs/components/hyperparameter-tuning/katib-config/). | ||
- Read about [Katib Configuration (Katib config)](/docs/components/katib/katib-config/). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for a11y:
- Read about [Katib Configuration (Katib config)](/docs/components/katib/katib-config/). | |
- Check the [Katib configuration (Katib config)](/docs/components/katib/katib-config/) page. |
Suggestion data can be retained in the volume. | ||
When you restart the experiment, suggestion's deployment and service are created and | ||
suggestion statistics can be recovered from the volume. | ||
After the experiment has succeeded, the suggestion's deployment and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After the experiment has succeeded, the suggestion's deployment and | |
After the experiment is successful, the suggestion's deployment and |
or maybe "has finished"? "Successful" can be subjective, IMHO
See the | ||
[from volume policy example](https://github.com/kubeflow/katib/blob/master/examples/v1beta1/resume-experiment/from-volume-resume.yaml#L18). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for a11y:
See the | |
[from volume policy example](https://github.com/kubeflow/katib/blob/master/examples/v1beta1/resume-experiment/from-volume-resume.yaml#L18). | |
Check the | |
[`from-volume-resume.yaml`](https://github.com/kubeflow/katib/blob/master/examples/v1beta1/resume-experiment/from-volume-resume.yaml#L18) | |
example to learn more. |
WDYT? It's more precise, since it's not a "tutorial" example, just a "hands-on" YAML file.
and [service](https://kubernetes.io/docs/concepts/services-networking/service/) | ||
are deleted and you can't restart the experiment. | ||
Read more about Katib concepts in [overview guide](/docs/components/hyperparameter-tuning/overview/#katib-concepts). | ||
Read more about Katib concepts in the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For a11y:
Read more about Katib concepts in the | |
Learn more about Katib concepts in the |
and [service](https://kubernetes.io/docs/concepts/services-networking/service/) | ||
are deleted and you can't restart the experiment. | ||
Read more about Katib concepts in [overview guide](/docs/components/hyperparameter-tuning/overview/#katib-concepts). | ||
Read more about Katib concepts in the | ||
[overview guide](/docs/components/hyperparameter-tuning/overview/#katib-concepts). | ||
|
||
See the [never resume policy example](https://github.com/kubeflow/katib/blob/master/examples/v1beta1/resume-experiment/never-resume.yaml#L20). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For a11y:
See the [never resume policy example](https://github.com/kubeflow/katib/blob/master/examples/v1beta1/resume-experiment/never-resume.yaml#L20). | |
Check the [`never-resume.yaml`](https://github.com/kubeflow/katib/blob/master/examples/v1beta1/resume-experiment/never-resume.yaml#L20) | |
example for more details. |
WDYT?
|
||
## Resume succeeded experiment | ||
|
||
To control various resume policies, you can specify `.spec.resumePolicy` for the experiment. | ||
To control various resume policies, you can specify `.spec.resumePolicy` | ||
for the experiment. | ||
See the [`ResumePolicy` type](https://github.com/kubeflow/katib/blob/master/pkg/apis/controller/experiments/v1beta1/experiment_types.go#L54). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See the [`ResumePolicy` type](https://github.com/kubeflow/katib/blob/master/pkg/apis/controller/experiments/v1beta1/experiment_types.go#L54). | |
(Refer to the [`ResumePolicy` type](https://github.com/kubeflow/katib/blob/master/pkg/apis/controller/experiments/v1beta1/experiment_types.go#L54).) |
While the experiment is running you are able to change trial count parameters. | ||
For example, if you want to decrease the maximum number of | ||
hyperparameter sets that are trained parallel. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may appear like an incomplete sentence because of the "if" statement. Let's try the following:
While the experiment is running you are able to change trial count parameters. | |
For example, if you want to decrease the maximum number of | |
hyperparameter sets that are trained parallel. | |
While the experiment is running you are able to change trial count parameters. | |
For example, you can decrease the maximum number of | |
hyperparameter sets that are trained in parallel. |
Note: hyperparam sets can be trained "in parallel" not "parallel", I think.
This page describes in detail how to modify running experiment | ||
and restart succeeded experiment. Follow this guide to know more | ||
about changing the experiment execution process and use various |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- "a running experiment" or "running experiment"
- "page" -> "guide" (everything can be a page?)
- "succeeded"? -> "completed" (success in ML experiments can be subjective, but completion (it's over) is more objective, I think)
- Talk to the reader - "you will learn more..."
This page describes in detail how to modify running experiment | |
and restart succeeded experiment. Follow this guide to know more | |
about changing the experiment execution process and use various | |
This guide describes how to modify running experiments | |
and restart completed experiments. You will learn | |
about changing the experiment execution process and use various |
Follow this guide to known more about changing experiment execution process and use various | ||
This page describes in detail how to modify running experiment | ||
and restart succeeded experiment. Follow this guide to know more | ||
about changing the experiment execution process and use various | ||
resume policies for the Katib experiment. | ||
|
||
For details of how to configure and run your experiment, see the guide to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Grammar
- a11y
For details of how to configure and run your experiment, see the guide to | |
For details on how to configure and run your experiment, check the guide on |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, check the links please (e.g. /docs/components/hyperparameter-tuning/experiment/
since some may be moving to .../kativ/
.. I think?)
@@ -142,7 +142,8 @@ These are the fields in the experiment configuration spec: | |||
* **resumePolicy**: Experiment resume policy. Can be one of `LongRunning`, `Never` or `FromVolume`. | |||
Default value is `LongRunning`. | |||
See the [`ResumePolicy` type](https://github.com/kubeflow/katib/blob/master/pkg/apis/controller/experiments/v1beta1/experiment_types.go#L54). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See the [`ResumePolicy` type](https://github.com/kubeflow/katib/blob/master/pkg/apis/controller/experiments/v1beta1/experiment_types.go#L54). | |
(Refer to the [`ResumePolicy` type](https://github.com/kubeflow/katib/blob/master/pkg/apis/controller/experiments/v1beta1/experiment_types.go#L54).) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For other GO files in the doc, can you please apply this logic, if it makes sense to you? It's mainly for a11y, as well as letting the reader know why you should "see" the file, since no reasons are originally provided. I think it can be good practice.
9c71d2e
to
e1fc105
Compare
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: andreyvelich The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
This PR is ready. |
6b8a0b7
to
49856e2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 👍
/lgtm
Please do a rebase. |
49856e2
to
391bfd6
Compare
/hold |
@gaocegege It's done. |
LGTM 👍 |
Fixes: kubeflow/katib#1292.
Blocked by: #2312.
I've added doc about restarting Katib experiment, please take a look.
/assign @johnugeorge @gaocegege
/cc @RFMVasconcelos @8bitmp3