Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zarr DirectoryStore with temporary directory-based transaction #247

Open
jakirkham opened this issue Mar 22, 2018 · 3 comments
Open

Zarr DirectoryStore with temporary directory-based transaction #247

jakirkham opened this issue Mar 22, 2018 · 3 comments

Comments

@jakirkham
Copy link
Member

On some file systems operations like writing and deleting can be a bit slow and can sometimes fail. It would be nice in these cases to use a temporary directory for intermediate steps. For instance writing of chunks could occur in a temporary directory with each chunk only getting moved in once the operation completes. Also deleting can simply move the content into a temporary directory and perform the deletion. As operations like rename are atomic on POSIX systems, this ensures at the end of operations like writing and deleting content that the DirectoryStore is never in an incomplete state. Further operations like deletion could be more easily performed in parallel as the content already appears to be deleted even though it merely got moved to a temporary directory somewhere where it is still getting cleaned up.

@alimanfoo
Copy link
Member

alimanfoo commented Mar 23, 2018 via email

@jakirkham
Copy link
Member Author

Yeah on NFS deletion is really slow. At least that has been my experience. Currently we move stuff to a temporary directory on the same filesystem and submit separate jobs to do the deletion in the background to workaround this issue. It's unfortunately actually that bad.

Was coming around to the same idea of having a Zarr local temporary directory (e.g. .ztmp 😉) as a staging ground.

@shoyer
Copy link
Contributor

shoyer commented Jun 24, 2021

A use-case for this feature has come up for us when writing to Zarr from jobs that may occasionally be pre-empted, e.g., using pre-emptible VMs on Google Cloud. This is generally very cost effective, but does mean that jobs need to be robust to being restarted at any time. Atomic writes for individual files are helpful, but ideally we would also like to either write the full Zarr store, or nothing at all.

This might be done with context managers, e.g.,

with zarr.transaction():
    dataset.to_zarr(...)  # using xarray

Or alternatively, perhaps via a custom store class, e.g.,

store = AtomicWriteStore(...)
dataset.to_zarr(store)
store.commit()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants