Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: directory checkpoint storage [DET-9594] #8255

Merged
merged 3 commits into from
Oct 31, 2023
Merged

Conversation

ioga
Copy link
Contributor

@ioga ioga commented Oct 26, 2023

Description

Introduce a new checkpoint storage type directory which writes the checkpoints and tensorboard files to a local directory.

Intended use cases are 1) using k8s PVC for ckpt storage 2) better representing detached mode storage paradigm

See included docs for the usage details.

Test Plan

Automated test for bind-mounted storage on agent RM is included:

  1. bind mount a directory into the experiment
  2. configure the same directory for storage using the new option
  3. make sure the data is saved to the mounted directory

k8s pod spec setup:

  1. make a PVC inside a k8s cluster, e.g. on GKE.
  2. mount it into an experiment using the pod_spec.
  3. make sure the data is on the PVC

For example, using a GKE cluster, make a yaml for PVC and create it using kubectl create -f <filename>

# pvc-demo.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: pvc-demo
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 10Gi
  storageClassName: standard-rwo

Run an experiment with the following extra content in the config, which'll mount and use the PVC. Since our PVC is a simple GKE PVC, it can only be mounted on a single node, so this must be a single slot experiment (or multi-slot but single-node, if you're brave enough).

# appended to your exp config yaml
checkpoint_storage:
  type: directory
  container_path: /tmp/somepath2/

environment:
  pod_spec:
    apiVersion: v1
    kind: Pod
    spec:
      volumes:
        - name: pvc-demo-vol
          persistentVolumeClaim:
            claimName: pvc-demo
      containers:
        - name: determined-container
          volumeMounts:
            - name: pvc-demo-vol
              mountPath: /tmp/somepath2

Then, run a notebook using the same PVC to check it's there:

# pvc-notebook.yaml
environment:
  pod_spec:
    apiVersion: v1
    kind: Pod
    spec:
      volumes:
        - name: pvc-demo-vol
          persistentVolumeClaim:
            claimName: pvc-demo
      containers:
        - name: determined-container
          volumeMounts:
            - name: pvc-demo-vol
              mountPath: /tmp/somepath2

det notebook start --config resoureces.slots=0 --config-file pvc-notebook.yaml

In the notebook in /tmp/somepath2, you should see a bunch of uuids for the checkpoints.
Inside the notebook, you can run det e list-checkpoints <experiment id>, find a checkpoint uuid, and then det checkpoint download <checkpoint uuid>. this should "download" the checkpoint into the current directory.

Commentary (optional)

Checklist

  • Changes have been manually QA'd
  • User-facing API changes need the "User-facing API Change" label.
  • Release notes should be added as a separate file under docs/release-notes/.
    See Release Note for details.
  • Licenses should be included for new code which was copied and/or modified from any external code.

Ticket

@cla-bot cla-bot bot added the cla-signed label Oct 26, 2023
@netlify
Copy link

netlify bot commented Oct 26, 2023

Deploy Preview for determined-ui ready!

Name Link
🔨 Latest commit 03ec7aa
🔍 Latest deploy log https://app.netlify.com/sites/determined-ui/deploys/654136459fa021000852c8b4
😎 Deploy Preview https://deploy-preview-8255--determined-ui.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@determined-ci determined-ci requested a review from a team October 26, 2023 19:02
@determined-ci determined-ci added the documentation Improvements or additions to documentation label Oct 26, 2023
@ioga ioga changed the title feat: directory checkpoint storage. feat: directory checkpoint storage [DET-9594] Oct 26, 2023
@ioga ioga marked this pull request as ready for review October 26, 2023 22:50
@ioga ioga requested review from a team as code owners October 26, 2023 22:50
Copy link
Contributor

@stoksc stoksc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

backend +1

@mpkouznetsov mpkouznetsov requested review from mpkouznetsov and removed request for tayritenour October 26, 2023 23:20
@@ -154,7 +156,7 @@ def test_start_tensorboard_for_multi_experiment(tmp_path: Path, secrets: Dict[st


@pytest.mark.e2e_cpu
def test_start_tensorboard_with_custom_image(tmp_path: Path) -> None:
def test_start_tensorboard_with_custom_image() -> None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@mpkouznetsov mpkouznetsov self-requested a review October 26, 2023 23:56
Copy link
Contributor

@mpkouznetsov mpkouznetsov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM


- ``containerPath``: The file system path inside the task pods to use. The checkpoints and
tensorboard data will be stored under this path.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tensorboard should be TensorBoard in all instances in the text


When downloading checkpoints (e.g., using ``det checkpoint download``), we assume the same
directory is present locally at the same ``container_path``.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.. warning::

When downloading checkpoints (e.g., using det checkpoint download), Determined assumes the same
directory is present locally at the same container_path.


When downloading checkpoints (e.g., using ``det checkpoint download``), we assume the same
directory is present locally at the same ``container_path``.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.. warning::

When downloading checkpoints (e.g., using det checkpoint download), Determined assumes the same
directory is present locally at the same container_path.

Copy link
Member

@tara-det-ai tara-det-ai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggested edits

@determined-ci determined-ci requested a review from a team October 31, 2023 16:46
@ioga
Copy link
Contributor Author

ioga commented Oct 31, 2023

merging over a lint-react flake

@ioga ioga merged commit f43acd8 into main Oct 31, 2023
70 of 82 checks passed
@ioga ioga deleted the ckpt-storage-directory branch October 31, 2023 17:52
rb-determined-ai pushed a commit that referenced this pull request Oct 31, 2023
@dannysauer dannysauer added this to the 0.26.3 milestone Feb 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla-signed documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants