Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Discussion] Inclusion PV for the MySQL in Katib standalone manifests #1479

Closed
andreyvelich opened this issue Mar 17, 2021 · 6 comments
Closed

Comments

@andreyvelich
Copy link
Member

/kind feature

See discussion: #1464 (comment).
Currently we exclude PV for MySQL in katib-with-kubeflow installation and include PV in katib-standalone installation.

Our users faced with many problems when they are using custom volume provisioner: #1415, #1156, #1212.

We should discuss what is the easiest and stable way for the users to install Katib.

/cc @anneum @hjkkhj123 @Utkagr

@heilerich
Copy link

One of your users here.

I think you don't have any problem with volume provisioners. I think what's causing all the issues you linked is this overzealous liveness probe.

livenessProbe:
exec:
command:
- "/bin/bash"
- "-c"
- "mysqladmin ping -u root -p${MYSQL_ROOT_PASSWORD}"
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 5

In environments that have slightly slower storage, the mysql container get's killed in its initialisation phase and never recovers from that. I would suggest increasing the initialDelaySeconds or creating an additional startupProbe as per the official kubernetes docs (see Protect slow starting containers with startup probes). Doing that should resolve all your users current problems.

See [1], [2]. Also I can confirm from personal experience that the empty root password issue (#1212) with the mysql container is also cause by the container being killed before it has gotten rid of the temporary initialisation database.

Regarding providing an "easy and stable" way for your users to deploy katib: You should seriously reconsider supplying a static hostPath PV in any circumstance. As I understand it the mysql database does not store ephemeral data in this case. Supplying a static hostPath PV creates a bunch of problems:

  • Scalability/Resource consumption: The database cannot be evicted/migrate to another host without data loss
  • Availability: If the database node goes down for maintenance or dies, katib will be unavailable
  • Safety: If the database node is recycled or dies, all data will be lost

So unless I am mistaken here and the database indeed stores only ephemeral data, supplying unsuspecting users with a static hostPath PV for non-ephemeral data seems like a recipe for disaster.

@andreyvelich
Copy link
Member Author

Thank you so much for sharing your thoughts @heilerich!
Couple of questions:

I would suggest increasing the initialDelaySeconds or creating an additional startupProbe as per the official kubernetes docs

What is the value we should specify for the initialDelaySeconds to cover such cases?

You should seriously reconsider supplying a static hostPath PV in any circumstance.

Do we have any other PV type that satisfies any K8s environment, whether it is GCP, AWS, on-prem, etc. ?

In general, I agree with you @heilerich. Setup the proper PV is the part of Kubernetes infrastructure environment, not Katib responsibility.

We can provide the manifest for MySQL PV and don't include it in the Kustomize install. In the documentation, we have to explain that user's Kubernetes cluster should have the provisioner or user can deploy PV manifest.

What do you think @gaocegege @johnugeorge @yanniszark ?

@heilerich
Copy link

What is the value we should specify for the initialDelaySeconds to cover such cases?

My preferred solution would be to add a startupProbe with the same command as the livenessProbe as per the aforementioned kubernetes docs. Maybe with periodSeconds: 15 and failureThreshold: 60. That should guarantee a safe container startup in almost any setup. Additionally, as opposed to just increasing the initialDelaySeconds, you don't have to compromise on delaying the restart of containers that already initialised successfully, but crashed again afterwards. I can't really come up with any downside to giving a database initialisation a generous timeframe to finish with this method.

Do we have any other PV type that satisfies any K8s environment, whether it is GCP, AWS, on-prem, etc. ?

I don't think you should be including any PV in the default installation manifests. It is common practice to just include the PVC and rely on the dynamic provisioning process to handle the rest. Dynamic provisioning is available at all big cloud providers and comes with many on-premise distributions nowadays. In fact, I have not encountered a PV that was bundled per default in any software that I deployed on kubernetes for a long time (of course excluding situations where it is explicitly necessary like installing drivers through a DaemonSet or physically tying hardware configuration to a node).

To me, including a PV with a specific storage flavour feels like actively circumventing the storage considerations that the administrators made for their environment. Which is fine if it is an optional setting, but might be dangerous if it is the default and can be overlooked when deploying katib.

@andreyvelich
Copy link
Member Author

My preferred solution would be to add a startupProbe with the same command as the livenessProbe as per the aforementioned kubernetes docs. Maybe with periodSeconds: 15 and failureThreshold: 60. That should guarantee a safe container startup in almost any setup. Additionally, as opposed to just increasing the initialDelaySeconds, you don't have to compromise on delaying the restart of containers that already initialised successfully, but crashed again afterwards. I can't really come up with any downside to giving a database initialisation a generous timeframe to finish with this method.

Make sense, I agree with that. As you said, let's just include failureThreshold in the livenessProbe and startupProbe and make initialDelaySeconds less than 30 seconds.
User can just increase failureThreshold, if it is not enough time to deploy MySQL.

To me, including a PV with a specific storage flavour feels like actively circumventing the storage considerations that the administrators made for their environment. Which is fine if it is an optional setting, but might be dangerous if it is the default and can be overlooked when deploying katib.

I agree. Probably, we should also reconsider our default configuration for Resuming Experiment from volume: https://www.kubeflow.org/docs/components/katib/resume-experiment/#resume-policy-fromvolume.
Currently, we deploy PV + PVC for the Suggestion if it is needed.

It would be great if we could discuss this issue in the next AutoML WG meeting.

cc @gaocegege @johnugeorge @yanniszark

@maanur
Copy link
Contributor

maanur commented Apr 15, 2021

According to Kubernetes concepts,

A PersistentVolume (PV) is a piece of storage in the cluster that has been provisioned by an administrator or dynamically provisioned using Storage Classes.

There is no katib StorageClass definition between installation manifests and no related PV dynamic provisioning controllers either, hence specifying .spec.storageClassName="katib" in PVC blocks its automatic provisioning.
So, only the "cluster administrator" part of the concept remains actual. But, you know, speaking from experience, if you want to use PVCs in a cluster you are administrating, the first step you make is ether manually configuring some PVs or deploying a dynamic provisioning controller. In such case, as cluster admin, I do not want anyone to define PVs as someone sees fit - I've got prepared configurations for that, and in extreme cases I'll check the particular PVC to define the best way to provision it manually (I'm the cluster administrator, after all).

By the way, the sole requirement I stumbled upon when running MySQL on PV is setting PVC's .spec.volumeMode="Filesystem".

@andreyvelich
Copy link
Member Author

This has been fixed by: #1527 and #1552

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants