Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zstd compression does not encode content size in header #182

Open
mkitti opened this issue Jul 26, 2024 · 2 comments
Open

Zstd compression does not encode content size in header #182

mkitti opened this issue Jul 26, 2024 · 2 comments

Comments

@mkitti
Copy link

mkitti commented Jul 26, 2024

The Zstd writer implemented here is based on the Zstd streaming API. When encoding a chunk, set pledged size is not used. This means the frame content size is not encoded in the Zstd frame header.

Some Zstd decoding implementations such as numcodecs.js and Zarr numcodecs rely upon ZSTD_getFrameContentSize() to allocate a decompression buffer.

https://github.com/zarr-developers/numcodecs/blob/main/numcodecs%2Fzstd.pyx#L182-L184

As written here, chunks encoded by Zstd via Tensorstore will return ZSTD_CONTENTSIZE_UNKNOWN from ZSTD_getFrameContentSize().

xref: google/neuroglancer#625

@mkitti mkitti changed the title Zstd compression does not encode content size in Zstd compression does not encode content size in header Jul 26, 2024
@mkitti
Copy link
Author

mkitti commented Jul 26, 2024

Here's an illustration of saving a zarr array with tensorstore using zstd compression. python-zstandard is unable to open the chunk unless max_output_size is provided.

In [1]: import tensorstore as ts, zstandard as zstd

In [2]: ds = ts.open({
   ...:     'driver': 'zarr',
   ...:     'kvstore': {
   ...:         'driver': 'file',
   ...:         'path': 'tmp/zarr_zstd_dataset',
   ...:     },
   ...:     'metadata': {
   ...:         'compressor': {
   ...:             'id': 'zstd',
   ...:             'level': 3,
   ...:         },
   ...:         'shape': [1024, 1024],
   ...:         'chunks': [64, 64],
   ...:         'dtype': '|u1',
   ...:     }
   ...: }).result()

In [3]: ds[:] = 5

In [4]: with open("tmp/zarr_zstd_dataset/0/0", "rb") as f:
   ...:     src = f.read()
   ...: 

In [5]: zstd.backend_c.frame_content_size(src)
Out[5]: -1

In [6]: zstd.ZstdDecompressor().decompress(src)
---------------------------------------------------------------------------
ZstdError                                 Traceback (most recent call last)
Cell In[6], line 1
----> 1 zstd.ZstdDecompressor().decompress(src)

ZstdError: could not determine content size in frame header

In [7]: zstd.ZstdDecompressor().decompress(src, max_output_size=1024*1024)
Out[7]: b'\x05\x05\x05\x05\x05\x05\x05\x05\x05 [...] \x05\x05\x05 '

For an example of being unable to open the dataset with zarr-python see zarr-developers/zarr-python#2056

@jbms
Copy link
Collaborator

jbms commented Jul 26, 2024

Thanks for investigating this so thoroughly!

We can probably ensure that tensorstore includes the uncompressed size in the header in this case, but in general there could be multiple variable-output-size codecs chained and it is desirable to be able to do streaming encoding.

Therefore in addition to that, other implementations should still support decoding without the size in the header.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants