Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Save Suggestion state in persistent volume #1250

Closed
andreyvelich opened this issue Jun 30, 2020 · 15 comments · Fixed by #1275
Closed

Save Suggestion state in persistent volume #1250

andreyvelich opened this issue Jun 30, 2020 · 15 comments · Fixed by #1275

Comments

@andreyvelich
Copy link
Member

andreyvelich commented Jun 30, 2020

/kind feature

To continue idea proposed here: #1062.
We would like to attach persistent volume to every deployed suggestion to save suggestion state after corresponding pod is deleted.
To implement this we should follow these steps:

  1. Extend ResumePolicyType with the new type. My idea, name it VolumeSource, any other ideas?

  2. Add new Storage Class YAML to Katib deployed manifests. I am not sure what should be the default provisioner for us, since k8s doesn't support dynamic volume provisioning for local storage (https://kubernetes.io/blog/2019/04/04/kubernetes-1.14-local-persistent-volumes-ga/#limitations-of-ga).
    We can be specific to GKE and use Persistent Disks (https://kubernetes.io/docs/concepts/storage/storage-classes/#gce-pd). Or use 3rd party local path provisioner (https://github.com/rancher/local-path-provisioner), this require additional controller.
    What do you think ?

  3. Implement new logic to the controller:

  • Create PVC when user submits Experiment.
  • Attach PVC to suggestion deployment.
  • Delete suggestion resources when Experiment is succeeded.
  • Restore suggestion resources when Experiment is resuming.
  1. Extend katib-config with the new parameters for suggestion PVC.

What do you think @gaocegege @johnugeorge ?

/cc @sperlingxx @c-bata @jlewi
/priority p0

@issue-label-bot
Copy link

Issue Label Bot is not confident enough to auto-label this issue.
See dashboard for more details.

@jlewi
Copy link
Contributor

jlewi commented Jul 6, 2020

Why use pickling to store data as opposed to YAMl?
Im lacking some backgeound. What precisely does persisting a suggestion mean? Is a suggestion just a set of hyperparameters?

If you use PDs are you going to have to manage disks?

Could you persist to object store or some other easily managed datastore?

Would it make sense to treat this data as metadata and store in metadata store?

@johnugeorge
Copy link
Member

johnugeorge commented Jul 6, 2020

Wrt to 1, we will need to have consistent naming with other values
@gaocegege Wrt to 2(Adding new storage class yaml), How is it handled?

Rest looks good to me

@andreyvelich
Copy link
Member Author

Thanks for the comment @jlewi.

Why use pickling to store data as opposed to YAMl?
Im lacking some backgeound. What precisely does persisting a suggestion mean? Is a suggestion just a set of hyperparameters?

Suggestion is just a Kubernetes deployment with running script for HP or NAS algorithm to produce new Trial parameters (in case of HP tunning - hyperparameters) from the Search Space.

Currently, when user submits Experiment, controller creates this Suggestion deployment. When Experiment is finished Suggestion deployment can be deleted or can be always-running (if user wants to resume Experiment later).
To avoid lacking resources with always-running Kubernetes deployment, we want to save Suggestion script state in Persistent Volume, after Experiment is finished.
Some HP algorithms have internal variables that needs to be retain before resuming experiment. Algorithms maintainers can use this PV, if they want to support resuming experiment in their suggestion.

I think one of the mechanism to save Suggestion python script state, can be pickling the executable class.

If you use PDs are you going to have to manage disks?

I am not sure that we want to manage them, because we should be not specific to the GCP. The question is, what should be the default structure for Storage class and PVCs ?

Could you persist to object store or some other easily managed datastore?

Do you have any ideas what it can be for Kubernetes?

Would it make sense to treat this data as metadata and store in metadata store?

Can we save serialization objects to metadata?

@jlewi
Copy link
Contributor

jlewi commented Jul 7, 2020

Some HP algorithms have internal variables that needs to be retain before resuming experiment. Algorithms maintainers can use this PV, if they want to support resuming experiment in their suggestion.

Why is the experiment manager managing internal storage of individual HP algorithms? Why not adopt a microservice architecture? What if different algorithms require different types of internal storage?

e.g. Suppose one algorithm needs to store a couple meta parameters and a YAML file works well vs. another algorithm which needs to store timeseries using a timeseries database?

Why can't HP Tuner algorithm authors configure their own storage backend? e.g. the algorithm author provides a kustomize package to deploy their algorithm and this is parameterized depending on the storage they accept. e.g. PVC, S3/GCS URL, SQL database etc...

If you don't want to waste resources when the service isn't be used can't we use autoscaling for that? e.g deploy the suggestion service server for that algorithm using knative?

I am not sure that we want to manage them, because we should be not specific to the GCP. The question is, what should be the default structure for Storage class and PVCs ?

This isn't specific to GCP. PVCs imply volumes which in many cases map to some form of "disk" as opposed to network or cloud filesystem.

@andreyvelich
Copy link
Member Author

Why can't HP Tuner algorithm authors configure their own storage backend? e.g. the algorithm author provides a kustomize package to deploy their algorithm and this is parameterized depending on the storage they accept. e.g. PVC, S3/GCS URL, SQL database etc...

This is exactly what we want to do, but use kaitb-config for it. User can specify various parameters for storage backend. First of all, we can start with different PVCs (e.g. Local Storage, GCEPersistentDisk, etc.). Later we can support others storage backend, like SQL database, if it's needed.

The question with PVC, what should be the default technique, if user doesn't want to make any changes in configuration, but use Resume Experiment feature.

If you don't want to waste resources when the service isn't be used can't we use autoscaling for that? e.g deploy the suggestion service server for that algorithm using knative?

Please, can you show some examples from knative projects, where they use this.

@jlewi
Copy link
Contributor

jlewi commented Jul 10, 2020

@andreyvelich I would suggest talking to the KFServing folks to better understand knative autoscaling.

@jlewi
Copy link
Contributor

jlewi commented Jul 10, 2020

@andreyvelich Is katib-config providing the config for the suggestion microservices? Did you consider a micro-service architecture? e.g. for each suggestion service have a set of YAML file describing its configuration (e.g Deployment, ConfigMap, PVCs if needed).

The other parts of Katib e.g. katib-config can then just take the URL of the suggestion endpoint.

@andreyvelich
Copy link
Member Author

@andreyvelich Is katib-config providing the config for the suggestion microservices? Did you consider a micro-service architecture? e.g. for each suggestion service have a set of YAML file describing its configuration (e.g Deployment, ConfigMap, PVCs if needed).

Yes, it was created to give user additional control over Suggestion service/deployment: https://www.kubeflow.org/docs/components/hyperparameter-tuning/katib-config/#suggestion-settings.
And we can easily extend it with different storage backend settings that we would like to support.

@jlewi
Copy link
Contributor

jlewi commented Jul 13, 2020

Yes, it was created to give user additional control over Suggestion service/deployment:
Why doesn't the operator (person deploying the suggestion services) just configure the service/deployment directly. What's the point of creating a layer of abstraction around that with a ConfigMap.

e.g. suppose I have two suggestion services

  • NAS
  • GridSearch

Each of these may have different backend requirements. Lets suppose NAS uses an SQL DB and GridSearch uses an object store. So each of them would have YAML manifests for the K8s services (Deployment, Service, etc... ) that they need. Operator would just customize them as needed.

@andreyvelich
Copy link
Member Author

e.g. suppose I have two suggestion services

  • NAS
  • GridSearch

Each of these may have different backend requirements. Lets suppose NAS uses an SQL DB and GridSearch uses an object store. So each of them would have YAML manifests for the K8s services (Deployment, Service, etc... ) that they need. Operator would just customize them as needed.

@jlewi We are not giving functionality to define whole YAML manifest for Suggestion resource to the user. To make it very easy to submit Katib Experiment and get results. Controller creates k8s deployment and service automatically for Suggestion.

For users that would like to modify default Suggestion installation we provide few settings, e.g. Service Account name.

And my thought is to add another setting that represent different volume technique. As I said we can start with various Storage Class provisioners.

@andreyvelich
Copy link
Member Author

Few thoughts after discussion on the Katib meeting:

  1. As @gaocegege mentioned Kubernetes HPA and knative autoscaling is designed for the stateless services, but most Katib Suggestion services are stateful.
    Some Suggestion has their own DB to store internal data. Because of that, having external storage for the Suggestion service state make sense.

  2. To avoid problems with supporting different provisioners for storage class (e.g persistent disks), we want to manually deploy PVC and PV with local storage for Suggestion in every Experiment. We are not sharing PV for multiple Suggestions for the security purpose in multi-user infra.

If user wants to use this feature, controller creates PV and PVC and binds it to Experiment's Suggestion deployment.

Later, we can add functionality when user can specify Storage class name in Katib config for Suggestion, because some Kubernetes cluster doesn't support creating PV manually.

What do you think @gaocegege @johnugeorge @jlewi ?

@issue-label-bot
Copy link

Issue-Label Bot is automatically applying the labels:

Label Probability
area/katib 0.55

Please mark this comment with 👍 or 👎 to give our bot feedback!
Links: app homepage, dashboard and code for this bot.

@gaocegege
Copy link
Member

LGTM

@andreyvelich
Copy link
Member Author

/assign @andreyvelich

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants