-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
to_zarr()
is extremely slow writing to high latency store
#277
Comments
Many many ideas for improvements. The Zarr backend we wrote was really meant to be an MVP, it absolutely needs some work. Here's my diagnosis:
My approach to (2) is to rethink the Zarr-Python API for creating hierarchies. You may be interested in the discussion here: zarr-developers/zarr-python#1569 |
Awesome, thanks for the info! I imagine (1) would require reimplementing a good chunk of |
In the meantime, this is plenty fast for the small data case: def to_zarr(dt, path):
with TemporaryDirectory() as tmp_path:
dt.to_zarr(tmp_path)
fs.put(tmp_path, path, recursive=True) Takes 1s on my example above instead of 3m. |
@slevang would you mind performing the same test with |
Looks like things are better but still very slow. The example in the OP now takes just over a minute on latest versions writing to GCS. I've done a little profiling, and the fundamental problem is still that we're synchronously creating each group via a separate To make this significantly better, unfortunately I think we need to drop the reliance on I'll do a little more digging and reopen on xarray. |
Unbearably so, I would say. Here is an example with a tree containing 13 nodes and negligible data, trying to write to S3/GCS with
fsspec
:Gives:
I suspect one of the culprits may be that we're having to reopen the store without consolidated metadata on writing each node:
datatree/datatree/io.py
Lines 205 to 223 in 433f78d
Any ideas for easy improvements here?
The text was updated successfully, but these errors were encountered: