Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

xcube to write sparse zarrs #688

Closed
AliceBalfanz opened this issue May 18, 2022 · 6 comments · Fixed by #729
Closed

xcube to write sparse zarrs #688

AliceBalfanz opened this issue May 18, 2022 · 6 comments · Fixed by #729

Comments

@AliceBalfanz
Copy link
Contributor

AliceBalfanz commented May 18, 2022

Is your feature request related to a problem? Please describe.
When I create an xcube dataset in zarr, all chunks are written to disk, even ones which are all NaNs. after writing the cube, I apply xcube prune to get rid of empty chunks to save disk space. zarr introduced in a recent release https://zarr.readthedocs.io/en/stable/release.html#release-2-11-0 that nan chunks are not written to disk. xcube should use that right away to save space and (user) time.

To reproduce:
I have an xcube env with zarr version 2.11.3

  1. I write a cube where I know whole chunks filled with Nans exist.
  2. I make a list of all files of the zarr directory
  3. I apply xcube prune to the newly created zarr
  4. make a list of the zarr directory again and compare it to the previous list. There I see, that xcube does not actively use zarrs enhancement yet, because the list after pruning is way shorter than the initial one.
@elliot-lvs
Copy link

Hi @AliceBalfanz,
I am experiencing the same problem and have in fact noticed that with the new rasterize_features algorithm writing to disk is much longer and slower than necessary and the resulting cube contains nan chunks written unnecessarily to disk.
Porting the library solution should fix at least part of the bug.

is it normal that it takes so long to import the improvements introduced in the used libraries?

@forman
Copy link
Member

forman commented Aug 23, 2022

@AliceBalfanz and @elliot-lvs can you please explain which function you use to write your datasets to disk?

@elliot-lvs
Copy link

@AliceBalfanz and @elliot-lvs can you please explain which function you use to write your datasets to disk?

raster = rasterize_features(...)
raster.to_zarr(path, mode='w')

@forman
Copy link
Member

forman commented Aug 26, 2022

have in fact noticed that with the new rasterize_features algorithm writing to disk is much longer and slower than necessary

Can you explain necessary please? Do you mean, it has been faster once?

is it normal that it takes so long to import the improvements introduced in the used libraries?

There is no special code in xcube that should prevent the to_zarr() method of xarray to drop nan-chunks. We'll have a look asap to find out what's going wrong.

@elliot-lvs
Copy link

have in fact noticed that with the new rasterize_features algorithm writing to disk is much longer and slower than necessary

Can you explain necessary please? Do you mean, it has been faster once?

is it normal that it takes so long to import the improvements introduced in the used libraries?

There is no special code in xcube that should prevent the to_zarr() method of xarray to drop nan-chunks. We'll have a look asap to find out what's going wrong.

Necessary was just a reference to the timings obtained concerning the little data written (even with an SSD).

I honestly don't know if it was faster before but, by not using rasterize_features, I get the impression that more data is being written for the same time. I'm working with a 2-3Mb Zarr and it takes less than a second to rasterize_features() and more than 2 minutes to write it on the SSD.

@forman
Copy link
Member

forman commented Sep 12, 2022

@AliceBalfanz @elliot-lvs

Just checked, the problem is, that xarray is not exploiting the new Zarr feature to not write empty chunks introduced in Zarr 2.11. When forcing related Zarr encoding option write_empty_chunks=False, I get an error:

encodings = {
    var_name: {**var.encoding, "write_empty_chunks": False}
    for var_name, var in dataset.data_vars.items()
}
dataset.to_zarr(path, mode="w", encoding=encodings)
ValueError: unexpected encoding parameters for zarr backend:  ['write_empty_chunks']

References

@forman forman mentioned this issue Sep 15, 2022
6 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants