Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kerchunk and Zarr V3 #235

Open
jhamman opened this issue Oct 18, 2022 · 3 comments
Open

Kerchunk and Zarr V3 #235

jhamman opened this issue Oct 18, 2022 · 3 comments

Comments

@jhamman
Copy link
Contributor

jhamman commented Oct 18, 2022

The Zarr V3 spec is now undergoing public review and testing. This issue raises the question of how Kerchunk should integrate with the new spec.

Key changes in the V3 spec that are relevant particularly to Kerchunk (zarr-developers/zarr-specs#149):

  1. change to chunk and metadata key names
  2. introduction of storage transformer extensions
  3. likely introduction of a Sharding Storage Transformer extension: Review of the ZEP2 spec - Sharding storage transformer zarr-developers/zarr-specs#152

Questions:

  1. Has any work been done to produce kerchunk references that align with the v3 storage key conventions?
  2. Could Kerchunk provide be thought of as a storage transformer in v3?
  3. The Sharding proposal (linked above) includes some references to putting shards in hdf5 files. Could Kerchunk extend that spec?
@martindurant
Copy link
Member

Some of this I'll have to think about, but some things I can answer immediately.

  • Kerchunk could produce v3 reference sets right now, and indeed convert v2<->v3 no problem, since it's only a rearrangement of paths. I don't think this would come with any benefit, though. No work has been done.
  • I am not sure kerchunk can be a storage transformer rather than a storage provider. If yes, I don't see why it would be beneficial in itself. There would need to be more done in that transformer to be worth it.
  • Yes, there is a thought to providing shards via kerchunk ( concatenating files #134 and preffs ) in a manner similar to but independent of the sharding spec.

I also want to mention that kerchunk should be useful for more than just zarr, so I will tend to favour things being coded in the storage layer rather than zarr-specific extensions. For example, reordering and selecting parquet files without touching the originals is something that kerchunk can do now. If you wanted full tabular iceberg compatibility using kerchunk/referenceFS, one could implement that now without too much trouble.

Here is the simplest non-zarr idea for CSVs: #66 (and, more generally, random access of delimited/block compressed data).

@mkitti
Copy link
Contributor

mkitti commented Jul 22, 2024

An application for this would be backporting Zarr v3 shards for availability via Zarr v2.

@martindurant
Copy link
Member

? I thought one of the main reasons for having a V3 at all was so that we could have new things like sharding ?

That is presumably why my working variable-chunking implementation for v2 was not given consideration.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants