Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add resume policy instructions for Katib experiments #2324

Merged

Conversation

andreyvelich
Copy link
Member

Fixes: kubeflow/katib#1292.
Blocked by: #2312.

I've added doc about restarting Katib experiment, please take a look.

/assign @johnugeorge @gaocegege
/cc @RFMVasconcelos @8bitmp3

@kubeflow-bot
Copy link

This change is Reviewable


- For a detailed instruction of the Katib Configuration file,
read the [Katib config page](/docs/components/hyperparameter-tuning/katib-config/).
- Read about [Katib Configuration (Katib config)](/docs/components/katib/katib-config/).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for a11y:

Suggested change
- Read about [Katib Configuration (Katib config)](/docs/components/katib/katib-config/).
- Check the [Katib configuration (Katib config)](/docs/components/katib/katib-config/) page.

Suggestion data can be retained in the volume.
When you restart the experiment, suggestion's deployment and service are created and
suggestion statistics can be recovered from the volume.
After the experiment has succeeded, the suggestion's deployment and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
After the experiment has succeeded, the suggestion's deployment and
After the experiment is successful, the suggestion's deployment and

or maybe "has finished"? "Successful" can be subjective, IMHO

Comment on lines 123 to 124
See the
[from volume policy example](https://github.com/kubeflow/katib/blob/master/examples/v1beta1/resume-experiment/from-volume-resume.yaml#L18).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for a11y:

Suggested change
See the
[from volume policy example](https://github.com/kubeflow/katib/blob/master/examples/v1beta1/resume-experiment/from-volume-resume.yaml#L18).
Check the
[`from-volume-resume.yaml`](https://github.com/kubeflow/katib/blob/master/examples/v1beta1/resume-experiment/from-volume-resume.yaml#L18)
example to learn more.

WDYT? It's more precise, since it's not a "tutorial" example, just a "hands-on" YAML file.

and [service](https://kubernetes.io/docs/concepts/services-networking/service/)
are deleted and you can't restart the experiment.
Read more about Katib concepts in [overview guide](/docs/components/hyperparameter-tuning/overview/#katib-concepts).
Read more about Katib concepts in the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For a11y:

Suggested change
Read more about Katib concepts in the
Learn more about Katib concepts in the

and [service](https://kubernetes.io/docs/concepts/services-networking/service/)
are deleted and you can't restart the experiment.
Read more about Katib concepts in [overview guide](/docs/components/hyperparameter-tuning/overview/#katib-concepts).
Read more about Katib concepts in the
[overview guide](/docs/components/hyperparameter-tuning/overview/#katib-concepts).

See the [never resume policy example](https://github.com/kubeflow/katib/blob/master/examples/v1beta1/resume-experiment/never-resume.yaml#L20).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For a11y:

Suggested change
See the [never resume policy example](https://github.com/kubeflow/katib/blob/master/examples/v1beta1/resume-experiment/never-resume.yaml#L20).
Check the [`never-resume.yaml`](https://github.com/kubeflow/katib/blob/master/examples/v1beta1/resume-experiment/never-resume.yaml#L20)
example for more details.

WDYT?


## Resume succeeded experiment

To control various resume policies, you can specify `.spec.resumePolicy` for the experiment.
To control various resume policies, you can specify `.spec.resumePolicy`
for the experiment.
See the [`ResumePolicy` type](https://github.com/kubeflow/katib/blob/master/pkg/apis/controller/experiments/v1beta1/experiment_types.go#L54).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
See the [`ResumePolicy` type](https://github.com/kubeflow/katib/blob/master/pkg/apis/controller/experiments/v1beta1/experiment_types.go#L54).
(Refer to the [`ResumePolicy` type](https://github.com/kubeflow/katib/blob/master/pkg/apis/controller/experiments/v1beta1/experiment_types.go#L54).)

Comment on lines 20 to 22
While the experiment is running you are able to change trial count parameters.
For example, if you want to decrease the maximum number of
hyperparameter sets that are trained parallel.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may appear like an incomplete sentence because of the "if" statement. Let's try the following:

Suggested change
While the experiment is running you are able to change trial count parameters.
For example, if you want to decrease the maximum number of
hyperparameter sets that are trained parallel.
While the experiment is running you are able to change trial count parameters.
For example, you can decrease the maximum number of
hyperparameter sets that are trained in parallel.

Note: hyperparam sets can be trained "in parallel" not "parallel", I think.

Comment on lines 8 to 10
This page describes in detail how to modify running experiment
and restart succeeded experiment. Follow this guide to know more
about changing the experiment execution process and use various
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • "a running experiment" or "running experiment"
  • "page" -> "guide" (everything can be a page?)
  • "succeeded"? -> "completed" (success in ML experiments can be subjective, but completion (it's over) is more objective, I think)
  • Talk to the reader - "you will learn more..."
Suggested change
This page describes in detail how to modify running experiment
and restart succeeded experiment. Follow this guide to know more
about changing the experiment execution process and use various
This guide describes how to modify running experiments
and restart completed experiments. You will learn
about changing the experiment execution process and use various

Follow this guide to known more about changing experiment execution process and use various
This page describes in detail how to modify running experiment
and restart succeeded experiment. Follow this guide to know more
about changing the experiment execution process and use various
resume policies for the Katib experiment.

For details of how to configure and run your experiment, see the guide to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Grammar
  • a11y
Suggested change
For details of how to configure and run your experiment, see the guide to
For details on how to configure and run your experiment, check the guide on

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, check the links please (e.g. /docs/components/hyperparameter-tuning/experiment/ since some may be moving to .../kativ/.. I think?)

@@ -142,7 +142,8 @@ These are the fields in the experiment configuration spec:
* **resumePolicy**: Experiment resume policy. Can be one of `LongRunning`, `Never` or `FromVolume`.
Default value is `LongRunning`.
See the [`ResumePolicy` type](https://github.com/kubeflow/katib/blob/master/pkg/apis/controller/experiments/v1beta1/experiment_types.go#L54).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
See the [`ResumePolicy` type](https://github.com/kubeflow/katib/blob/master/pkg/apis/controller/experiments/v1beta1/experiment_types.go#L54).
(Refer to the [`ResumePolicy` type](https://github.com/kubeflow/katib/blob/master/pkg/apis/controller/experiments/v1beta1/experiment_types.go#L54).)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For other GO files in the doc, can you please apply this logic, if it makes sense to you? It's mainly for a11y, as well as letting the reader know why you should "see" the file, since no reasons are originally provided. I think it can be good practice.

@andreyvelich andreyvelich force-pushed the issue-1292-resume-experiment-doc branch from 9c71d2e to e1fc105 Compare November 11, 2020 15:26
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andreyvelich

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@andreyvelich andreyvelich changed the title [WIP] Add resume policy instructions for Katib experiments Add resume policy instructions for Katib experiments Nov 11, 2020
@andreyvelich
Copy link
Member Author

This PR is ready.
/cc @8bitmp3 @gaocegege @johnugeorge

Copy link
Member

@gaocegege gaocegege left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍
/lgtm

@gaocegege
Copy link
Member

Please do a rebase.

@andreyvelich
Copy link
Member Author

/hold

@andreyvelich
Copy link
Member Author

@gaocegege It's done.
/hold cancel

@gaocegege
Copy link
Member

LGTM 👍
/lgtm

@k8s-ci-robot k8s-ci-robot merged commit 8c7a5d3 into kubeflow:master Nov 14, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Documentation] Resume Experiment feature
6 participants