-
Notifications
You must be signed in to change notification settings - Fork 120
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Zstd compression does not encode content size in header #182
Comments
Here's an illustration of saving a zarr array with tensorstore using zstd compression. python-zstandard is unable to open the chunk unless In [1]: import tensorstore as ts, zstandard as zstd
In [2]: ds = ts.open({
...: 'driver': 'zarr',
...: 'kvstore': {
...: 'driver': 'file',
...: 'path': 'tmp/zarr_zstd_dataset',
...: },
...: 'metadata': {
...: 'compressor': {
...: 'id': 'zstd',
...: 'level': 3,
...: },
...: 'shape': [1024, 1024],
...: 'chunks': [64, 64],
...: 'dtype': '|u1',
...: }
...: }).result()
In [3]: ds[:] = 5
In [4]: with open("tmp/zarr_zstd_dataset/0/0", "rb") as f:
...: src = f.read()
...:
In [5]: zstd.backend_c.frame_content_size(src)
Out[5]: -1
In [6]: zstd.ZstdDecompressor().decompress(src)
---------------------------------------------------------------------------
ZstdError Traceback (most recent call last)
Cell In[6], line 1
----> 1 zstd.ZstdDecompressor().decompress(src)
ZstdError: could not determine content size in frame header
In [7]: zstd.ZstdDecompressor().decompress(src, max_output_size=1024*1024)
Out[7]: b'\x05\x05\x05\x05\x05\x05\x05\x05\x05 [...] \x05\x05\x05 ' For an example of being unable to open the dataset with zarr-python see zarr-developers/zarr-python#2056 |
Thanks for investigating this so thoroughly! We can probably ensure that tensorstore includes the uncompressed size in the header in this case, but in general there could be multiple variable-output-size codecs chained and it is desirable to be able to do streaming encoding. Therefore in addition to that, other implementations should still support decoding without the size in the header. |
The Zstd writer implemented here is based on the Zstd streaming API. When encoding a chunk, set pledged size is not used. This means the frame content size is not encoded in the Zstd frame header.
Some Zstd decoding implementations such as numcodecs.js and Zarr numcodecs rely upon
ZSTD_getFrameContentSize()
to allocate a decompression buffer.https://github.com/zarr-developers/numcodecs/blob/main/numcodecs%2Fzstd.pyx#L182-L184
As written here, chunks encoded by Zstd via Tensorstore will return
ZSTD_CONTENTSIZE_UNKNOWN
fromZSTD_getFrameContentSize()
.xref: google/neuroglancer#625
The text was updated successfully, but these errors were encountered: