xcube to write sparse zarrs #688

AliceBalfanz · 2022-05-18T14:12:56Z

Is your feature request related to a problem? Please describe.
When I create an xcube dataset in zarr, all chunks are written to disk, even ones which are all NaNs. after writing the cube, I apply xcube prune to get rid of empty chunks to save disk space. zarr introduced in a recent release https://zarr.readthedocs.io/en/stable/release.html#release-2-11-0 that nan chunks are not written to disk. xcube should use that right away to save space and (user) time.

To reproduce:
I have an xcube env with zarr version 2.11.3

I write a cube where I know whole chunks filled with Nans exist.
I make a list of all files of the zarr directory
I apply xcube prune to the newly created zarr
make a list of the zarr directory again and compare it to the previous list. There I see, that xcube does not actively use zarrs enhancement yet, because the list after pruning is way shorter than the initial one.

elliot-lvs · 2022-08-22T14:23:59Z

Hi @AliceBalfanz,
I am experiencing the same problem and have in fact noticed that with the new rasterize_features algorithm writing to disk is much longer and slower than necessary and the resulting cube contains nan chunks written unnecessarily to disk.
Porting the library solution should fix at least part of the bug.

is it normal that it takes so long to import the improvements introduced in the used libraries?

forman · 2022-08-23T09:29:02Z

@AliceBalfanz and @elliot-lvs can you please explain which function you use to write your datasets to disk?

elliot-lvs · 2022-08-25T16:26:03Z

@AliceBalfanz and @elliot-lvs can you please explain which function you use to write your datasets to disk?

raster = rasterize_features(...)
raster.to_zarr(path, mode='w')

forman · 2022-08-26T07:10:36Z

have in fact noticed that with the new rasterize_features algorithm writing to disk is much longer and slower than necessary

Can you explain necessary please? Do you mean, it has been faster once?

is it normal that it takes so long to import the improvements introduced in the used libraries?

There is no special code in xcube that should prevent the to_zarr() method of xarray to drop nan-chunks. We'll have a look asap to find out what's going wrong.

elliot-lvs · 2022-08-29T10:45:44Z

have in fact noticed that with the new rasterize_features algorithm writing to disk is much longer and slower than necessary

Can you explain necessary please? Do you mean, it has been faster once?

is it normal that it takes so long to import the improvements introduced in the used libraries?

There is no special code in xcube that should prevent the to_zarr() method of xarray to drop nan-chunks. We'll have a look asap to find out what's going wrong.

Necessary was just a reference to the timings obtained concerning the little data written (even with an SSD).

I honestly don't know if it was faster before but, by not using rasterize_features, I get the impression that more data is being written for the same time. I'm working with a 2-3Mb Zarr and it takes less than a second to rasterize_features() and more than 2 minutes to write it on the SSD.

forman · 2022-09-12T06:26:57Z

@AliceBalfanz @elliot-lvs

Just checked, the problem is, that xarray is not exploiting the new Zarr feature to not write empty chunks introduced in Zarr 2.11. When forcing related Zarr encoding option write_empty_chunks=False, I get an error:

encodings = {
    var_name: {**var.encoding, "write_empty_chunks": False}
    for var_name, var in dataset.data_vars.items()
}
dataset.to_zarr(path, mode="w", encoding=encodings)

ValueError: unexpected encoding parameters for zarr backend:  ['write_empty_chunks']

References

Can't set write_empty_chunks in Zarr encoding pydata/xarray#6347
Zarr backend should avoid checking for invalid encodings pydata/xarray#6373
https://zarr.readthedocs.io/en/stable/tutorial.html#empty-chunks explains new encoding option write_empty_chunks
https://zarr.readthedocs.io/en/stable/api/creation.html in Zarr's API reference
xarray docs are not mentioning the new Zarr feature: https://docs.xarray.dev/en/stable/internals/zarr-encoding-spec.html#zarr-encoding

forman mentioned this issue Sep 15, 2022

Fix broken JSON encoding #729

Merged

6 tasks

forman closed this as completed in #729 Sep 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

xcube to write sparse zarrs #688

xcube to write sparse zarrs #688

AliceBalfanz commented May 18, 2022 •

edited

Loading

elliot-lvs commented Aug 22, 2022

forman commented Aug 23, 2022

elliot-lvs commented Aug 25, 2022

forman commented Aug 26, 2022

elliot-lvs commented Aug 29, 2022

forman commented Sep 12, 2022

xcube to write sparse zarrs #688

xcube to write sparse zarrs #688

Comments

AliceBalfanz commented May 18, 2022 • edited Loading

elliot-lvs commented Aug 22, 2022

forman commented Aug 23, 2022

elliot-lvs commented Aug 25, 2022

forman commented Aug 26, 2022

elliot-lvs commented Aug 29, 2022

forman commented Sep 12, 2022

AliceBalfanz commented May 18, 2022 •

edited

Loading