Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

💡[feat] support local NAS checkpoint storage type #7619

Open
Violin9906 opened this issue Aug 13, 2023 · 1 comment
Open

💡[feat] support local NAS checkpoint storage type #7619

Violin9906 opened this issue Aug 13, 2023 · 1 comment
Labels
feature Feature requests

Comments

@Violin9906
Copy link

Describe the problem

I'm deploying Determined.ai on our local self-hosted k8s cluster. There is a NAS inside the cluster used as shared storage. However, at the moment, determined doesn't seem to support using the local NAS directly for checkpoint storage, I have to manually mount the NAS on each device in the cluster using nfs and then set the checkpoint storage for determined to shared_fs.

Describe the solution you'd like

Supports one of the common NAS protocols for checkpoint storage, like NFS, SMB, WebDAV or so.

Describe alternatives you've considered

For a k8s cluster configured a NFS as a storage class (see this), using Persistent Volume as a checkpoint storage is an alternative.

Additional context

No response

@Violin9906 Violin9906 added the feature Feature requests label Aug 13, 2023
@rb-determined-ai
Copy link
Member

I think this is a good feature request, and it's not the first time we've heard it. I'll see if I can get somebody on our k8s team to take a closer look.

I have to manually mount the NAS on each device in the cluster using nfs and then set the checkpoint storage for determined to shared_fs.

In the mean time, I think there is a better workaround than what you're currently doing. If you set up the pod with the NAS/SMB/WebDAV/whatever storage, you can mount that storage to /determined_shared_fs/my_storage (the my_storage name is arbitrary, but /determined_shared_fs is not). Then you can configure the shared_fs in your experiment to be something like:

checkpoint_storage:
  type: shared_fs
  host_path: /any/valid/directory/at/all
  storage_dir: my_storage

What our system will do under that configuration is mount each node's /any/valid/directory/at/all to /determined_shared_fs. Inside that directory, k8s will have mounted your NAS/SMB/WebDAV/whatever storage to /determined_shared_fs/my_storage, and our python libraries will automatically use the full /determined_shared_fs/my_storage path for all checkpoint information, so the extra mount from the node's filesystem is effectively ignored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Feature requests
Projects
None yet
Development

No branches or pull requests

2 participants