-
Notifications
You must be signed in to change notification settings - Fork 356
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: directory checkpoint storage [DET-9594] #8255
Conversation
✅ Deploy Preview for determined-ui ready!
To edit notification comments on pull requests, go to your Netlify site configuration. |
b818980
to
5fa8f88
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
backend +1
@@ -154,7 +156,7 @@ def test_start_tensorboard_for_multi_experiment(tmp_path: Path, secrets: Dict[st | |||
|
|||
|
|||
@pytest.mark.e2e_cpu | |||
def test_start_tensorboard_with_custom_image(tmp_path: Path) -> None: | |||
def test_start_tensorboard_with_custom_image() -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
||
- ``containerPath``: The file system path inside the task pods to use. The checkpoints and | ||
tensorboard data will be stored under this path. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tensorboard should be TensorBoard in all instances in the text
|
||
When downloading checkpoints (e.g., using ``det checkpoint download``), we assume the same | ||
directory is present locally at the same ``container_path``. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.. warning::
When downloading checkpoints (e.g., using det checkpoint download
), Determined assumes the same
directory is present locally at the same container_path
.
|
||
When downloading checkpoints (e.g., using ``det checkpoint download``), we assume the same | ||
directory is present locally at the same ``container_path``. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.. warning::
When downloading checkpoints (e.g., using det checkpoint download
), Determined assumes the same
directory is present locally at the same container_path
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggested edits
9b99693
to
03ec7aa
Compare
merging over a lint-react flake |
Description
Introduce a new checkpoint storage type
directory
which writes the checkpoints and tensorboard files to a local directory.Intended use cases are 1) using k8s PVC for ckpt storage 2) better representing detached mode storage paradigm
See included docs for the usage details.
Test Plan
Automated test for bind-mounted storage on agent RM is included:
k8s pod spec setup:
For example, using a GKE cluster, make a yaml for PVC and create it using
kubectl create -f <filename>
Run an experiment with the following extra content in the config, which'll mount and use the PVC. Since our PVC is a simple GKE PVC, it can only be mounted on a single node, so this must be a single slot experiment (or multi-slot but single-node, if you're brave enough).
Then, run a notebook using the same PVC to check it's there:
det notebook start --config resoureces.slots=0 --config-file pvc-notebook.yaml
In the notebook in
/tmp/somepath2
, you should see a bunch of uuids for the checkpoints.Inside the notebook, you can run
det e list-checkpoints <experiment id>
, find a checkpoint uuid, and thendet checkpoint download <checkpoint uuid>
. this should "download" the checkpoint into the current directory.Commentary (optional)
Checklist
docs/release-notes/
.See Release Note for details.
Ticket