Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Sharding v1.0 Spec & Storage Transformers to Zarr v3.0 #134

Conversation

jstriebel
Copy link
Member

This PR adds a specification for the sharding protocol, as discussed in issue #127 and explored in Python implementations previously1. From those discussions the presented sharding design emerged as a reasonable candidate we hope to include with zarr v3. We, this is scalable minds, will also provide a PR based on this spec for sharding support in zarr-python v3 after those changes are incorporated into the zarr-spec. This work is funded by the CZI through the EOSS program.

This PR adds the sharding & storage transformer concepts to the zarr v3 specs.
The following parts are copied from the spec in this PR:

Sharding

This specification defines an implementation of the Zarr
storage transformer protocol for sharding.

Sharding co-locates multiple chunks within a storage object, bundling them in shards.

Motivation

In many cases it becomes inefficient or impractical to store a large number of chunks in
single files or objects due to the design constraints of the underlying storage,
for example as restricted by the file block size and maximum inode number for typical file systems.

Increasing the chunk size works only up to a certain point, as chunk sizes need to be small for
read efficiency requirements, for example to stream data in browser-based visualization software.

Therefore, chunks may need to be smaller than the minimum size of one storage key.
In those cases it is required to store objects at a more coarse granularity than reading chunks.
Sharding solves this by allowing to store multiple chunks in one storage key, which is called a shard:
sharding

Storage transformer

A Zarr storage transformer allows to change the zarr-compatible data before storing it.
The stored transformed data is restored to its original state whenever data is requested
by the Array. Storage transformers can be configured per array via the
storage_transformers name in the array metadata.

A storage transformer serves the same Abstract store interface as the store.
However, it should not persistently store any information necessary to restore the original data,
but instead propagates this to the next storage transformer or the final store.
From the perspective of an Array or a previous stage transformer both store and storage transformer follow the same
protocol and can be interchanged regarding the protocol. The behaviour can still be different,
e.g. requests may be cached or the form of the underlying data can change.

Storage Transformers may be stacked to combine different functionalities:

    graph LR
      Array --> t1
      subgraph stack [Storage transformers]
        t1[Transformer 1] --> t2[...] --> t3[Transformer N]
      end
      t3 --> Store
Loading
Sequence diagram with storage transformer Please click the arrows on the right to see the diagram
  sequenceDiagram
    actor User
    participant Codec as Codec (& Filters)
    participant Array
    participant ST as Storage Transformers
    participant Store
  
    User->>Array: a[10:20]
    Array->>Array: index_to_chunks
    Array->>ST: get(…)
    loop for all transformers
      ST->>ST: get(…)
    end
    ST->>Store: get(…)
    Store-->>ST: bytes
    loop for all transformers
      ST-->>ST: return bytes
    end
    ST-->>Array: bytes
    Array->>Codec: decode(bytes)
    Codec-->>Array: decoded_bytes
    loop for all filters
      Array->>Codec: decode(bytes)
      Codec-->>Array: decoded_bytes
    end
    Array->>Array: interpret_as_array
    Array-->>User: data
Loading

The spec contains many more details, please see the diff of this PR for a full overview.
To see the rendered spec in an auto-reloading server, I ran

## optionally setup an venv
# python3 -m venv .venv
# . .venv/bin/activate
pip install -r docs/requirements
pip install sphinx-autobuild
sphinx-autobuild docs docs/_build/html

We'd be very happy to get feedback on this PR, high-level as well as concrete suggestions and questions.
Two high-level questions that came up directly during the design:

  • Should we introduce the concept of Storage Transformers here? This concept seems to be a great point to plug in other future extensions, and quite the right abstraction for sharding. An alternative would be to incorporate sharding as one specific mechanism without adding the whole transformer concept. Implementation wise this would behave just like one specific transformer, but the logic could also be implemented closer to the Array or Store instead.2
  • Should storage transformers be formulated as extensions? Then the storage_transformer key for each transformer would rather be called extension and the extension would need to be listed in the array metadata extensions key I guess. This would be helpful if optional storage transformers might be added. One such example might be a caching storage transformer (not actually transforming things, just caching).

@alimanfoo In the community call two weeks ago there was the suggestion that you might be a great candidate for the first main review, since there's quite some overlap with issue #82. What do you think?

Thanks everyone for the helpful discussions so far! We hope to get the design finalized via this spec and look forward to getting this into zarr production 🎉

cc @joshmoore @d-v-b @jbms @jakirkham @MSanKeys963 @rabernat @shoyer @d70-t @chris-allen @mkitti @FrancescAlted @martindurant @DennisHeimbigner @constantinpape

Footnotes

  1. Previous explorations regarding Sharding with the Python implementations:
    https://github.com/zarr-developers/zarr-python/issues/877: original issue for the Python implementation, containing many discussions around the sharding protocol design
    https://github.com/zarr-developers/zarr-python/issues/876: initial prototype PR
    https://github.com/zarr-developers/zarr-python/issues/947: another prototype where the translation logic happens purely in the Array class
    https://github.com/alimanfoo/zarrita/pull/40: PR in the clean room implementation zarrita in the spirit of the Prototype I, but simplified as much as possible

  2. Sidenote: I think the storage transformer serves a similar purpose on the store side as filters do on the encoding side. They provide a clear interface to add store-agnostic functionality, similar as filters do that for encodings. See also the collapsed Sequence diagram with storage transformer above.

@jbms
Copy link
Contributor

jbms commented Feb 24, 2022

Should we introduce the concept of Storage Transformers here? This concept seems to be a great point to plug in other future extensions, and quite the right abstraction for sharding. An alternative would be to incorporate sharding as one specific mechanism without adding the whole transformer concept. Implementation wise this would behave just like one specific transformer, but the logic could also be implemented closer to the Array or Store instead.

One thing that I think would really help address this question is if we consider what other storage transformers might be useful, and how they would stack before and after sharding.

Two ideas I've heard mentioned are:

  1. content-addressable storage (CAS): the store adds an additional level of indirection and deduplicates based on content hash.
  2. independent of CAS, something that adds chunksums in order to detect data corruption

It seems to me that 1 (content-addressable storage) could be stacked after sharding, though since it would deduplicate entire shards it would not be particularly useful (and to actually verify the checksum of partial reads would require a more complicated tree-of-checksums approach). It is less clear how it would work to stack it before sharding; probably the sharding tranformer would need to be modified to just pass through the hash keys (so the shards would just store the keys of each chunk, not the chunk itself), but then it would fail to accomplish the original objective of sharding, which is to reduce the number of keys in the underlying store.

It seems that (2) could be stacked before or after sharding, though if stacked after sharding it would need a more elaborate tree-of-checksums approach in order to allow verifying the checksums of partial reads. If stacked before sharding, it could also be implemented as a codec.

The other point, not directly relevant to the spec, is that it would be useful to expose in a convenient way on the array class itself something that indicates the optimal chunk shape for writing (i.e. the shape of each shard); previously users would just use the chunk shape for this purpose, but with sharding a separate attribute is needed.

@jstriebel
Copy link
Member Author

One thing that I think would really help address this question is if we consider what other storage transformers might be useful, and how they would stack before and after sharding.

Agreed! That's very important to assess if storage transformers are a helpful concept here. We might not need to "fix" compatibility of sharding with possible future storage transformers yet, but it's very useful to explore.

My understanding for the CAS proposal is that the hash keys are stored at the original data location, and that the value entries are stored under a different prefix, e.g. content/…. Having 1 (CAS) stacked after sharding, my understanding also is that the keys are then grouped per shard, which does solve half the problem in this case, and content keys are still per chunk. Deduplicating the chunks and combining chunks into regular grid sized shards seem to be contrary concepts at first glance to me, but the content keys could be combined with a different strategy (e.g. some tree).

Besides the two possible future proposals you mentioned, I found two storage transformers already in the current v2 zarr-python implementation, which are implemented as stores wrapping another store (basically the same thing as a storage transformer, without the formalizations):

  • `ConsolidatedMetadataStore](https://zarr.readthedocs.io/en/stable/api/storage.html#zarr.storage.ConsolidatedMetadataStore)

    A layer over other storage, where the metadata has been consolidated into
    a single key.
    The purpose of this class, is to be able to get all of the metadata for
    a given array in a single read operation from the underlying storage.
    See zarr.convenience.consolidate_metadata for how to create this
    single metadata key.
    This class loads from the one key, and stores the data in a dict, so that
    accessing the keys no longer requires operations on the backend store.
    This class is read-only, and attempts to change the array metadata will
    fail, but changing the data is possible. If the backend storage is changed
    directly, then the metadata stored here could become obsolete, and
    zarr.convenience.consolidate_metadata should be called again and the class
    re-invoked. The use case is for write once, read many times.

  • LRUStoreCache:

    Storage class that implements a least-recently-used (LRU) cache layer over
    some other store. Intended primarily for use with stores that can be slow to
    access, e.g., remote stores that require network communication to store and
    retrieve data.

Also, compatibility layers might be implemented at the storage transformer level, to be able to transform metadata directly.

What I quite like about the storage transformer stack concept already, it makes it easy to wrap your head around and communicate the exact behavior (e.g. what does X before Y or Y before X in the stack), since it brings a clear interface and abstraction. However, I think it's also perfectly fine to "just" introduce sharding as a core mechanism, without the additional transformer concept.

@alimanfoo
Copy link
Member

Hi @jstriebel, great to see this and very happy to review.

One small thought, in approaching this it might be useful to distinguish between storage transformers, which do actually make some change to how the data is organised within the base store, and other types of storage layer such as caching, which are not changing how data are organised but offer some other advantage such as performance.

The obvious reason for this distinction is that if a true storage transformer has been used, then a zarr implementation that's trying to read the data needs to know about it, and so it needs to be included in the zarr metadata so the implementation can discover it. That is equally true regardless of which zarr implementation is being used, and what type of base storage is being used.

Other storage layers such as caches are configured at runtime depending on the application and/or the zarr implementation and/or the base store type, and so should not be included in zarr metadata.

@shoyer
Copy link

shoyer commented Feb 25, 2022

I wonder about the efficiency of the indexed shard format. Depending on shard/chunk size, reading the index could potentially be moderately expensive. In the worst case (without caching), there would be twice the number of IO operations required to read a chunk.

The alternative would be some form of higher level index, e.g., at the level of an entire array. For moderate numbers of chunks (e.g., <1e7) this might be desirable.

Do we have a sense of what shard/chunk sizes would be desired in practice? That might help inform these design decisions.

@jbms
Copy link
Contributor

jbms commented Feb 25, 2022

The desired shard size probably depends on a lot of factors, such as relative cost/overhead per key/value pair on the underlying store, the throughput of the underlying store, and the desired write pattern. But I think for GCS or similar cloud storage I would probably target 1GB shards normally, perhaps containing several thousand chunks. Possibly for a very large array and a fast network could go larger, ~10GB, but then the shard index starts to be rather large.

A single shard index for the entire array means it is challenging to write the array in parallel without additional coordination, which I think is a major drawback.

Additionally, it seems like in the uncached case it is actually worse than a separate index per shard -- you would still be reading the entire shard index. However it is true that if you first read a complete index of all chunks and cache it, then it would be more efficient to be able to read that index with a single read rather than one read per shard.

@shoyer
Copy link

shoyer commented Feb 25, 2022

But I think for GCS or similar cloud storage I would probably target 1GB shards normally, perhaps containing several thousand chunks.

This sounds about right to me.

A single shard index for the entire array means it is challenging to write the array in parallel without additional coordination, which I think is a major drawback.

I agree, I was imagining this as more similar to consolidated metadata -- something that is written after the fact.

In any case, I guess this is something that could be layered on later, if desired.

@martindurant
Copy link
Member

Interesting connection here in what happens in kerchunk? You can already create a virtual dataset over many zarr datasets, where each of those may live in different buckets or even different backing stores. This would correspond to the idea earlier in this thread of moving sharding into the storage layer, but also comes with the benefit of consolidating the metadata. I'm not saying it's a replacement, but perhaps a complementary way to achieving something similar, one that already works in V2.

@jstriebel
Copy link
Member Author

One small thought, in approaching this it might be useful to distinguish between storage transformers, which do actually make some change to how the data is organised within the base store, and other types of storage layer such as caching, which are not changing how data are organised but offer some other advantage such as performance.

@alimanfoo Good point, absolutely agree! I think runtime-configured storage-transformers could mostly be appended or prepended to the stack. Do you already have further feedback regarding the specs? I'm on vacation for a week, but happy to apply any feedback afterwards.

Interesting connection here in what happens in kerchunk? You can already create a virtual dataset over many zarr datasets, where each of those may live in different buckets or even different backing stores. This would correspond to the idea earlier in this thread of moving sharding into the storage layer, but also comes with the benefit of consolidating the metadata. I'm not saying it's a replacement, but perhaps a complementary way to achieving something similar, one that already works in V2.

@martindurant Agreed! I think there's value in having this in the zarr standard, but interesting to see how this would be done in kerchunk. This would require configuring each shard in kerchunk, right?

@martindurant
Copy link
Member

Interesting connection here in what happens in kerchunk? You can already create a virtual dataset over many zarr datasets, where each of those may live in different buckets or even different backing stores. This would correspond to the idea earlier in this thread of moving sharding into the storage layer, but also comes with the benefit of consolidating the metadata. I'm not saying it's a replacement, but perhaps a complementary way to achieving something similar, one that already works in V2.

@martindurant Agreed! I think there's value in having this in the zarr standard, but interesting to see how this would be done in kerchunk. This would require configuring each shard in kerchunk, right?

Each chunk of each shard is referenced by the collected reference set. So kerchunk scans the inputs just once and collates this information. "configuring" sounds more heavy, I'm not sure what you mean - it's pretty simple.

@jstriebel
Copy link
Member Author

@alimanfoo It would be great to get your review on this proposal, could you maybe give an estimate when you have time for it?

@rabernat
Copy link
Contributor

rabernat commented Apr 8, 2022

@jstriebel - thanks a lot for all of your hard work and patience. 🙏

This PR has revealed that our governance is still lacking in terms of ability to make major decisions in a timely way. I recognize that this must be frustrating for a contributor such as yourself who has put a lot of work into Zarr. The situation reflects the fact that we are still in transition away from a BDFL governance model towards a more community-based one.

I am very optimistic that the ZEP process (currently being refined in zarr-developers/governance#16) will give us a good framework to decide on this PR. I hope we will conclude the ZEP discussion and accept it by the end of next week. After that, we can reframe this PR as a ZEP, with your top-level comment basically serving as the text of the ZEP. If my suggestion in zarr-developers/governance#16 (review) is accepted, approval would require

  • Unanimous approval of the Zarr Steering Committee
  • Majority approval of the Zarr Implementations Committee. Approval indicates that that implementation plans to implement the new spec features. Not all implementations must implement all features, but a majority is required.
  • No vetos from the Zarr Implementations Committee. Each implementation has the right to veto a ZEP if it would cause severe problems for that implementation.

I will personally commit to doing whatever I can over the next weeks to ensure that this process evolves in a timely way. Because this work touches the spec so deeply, you have inadvertently become a 🐹 (guinea pig, not hamster!) in our nascent spec governance process. I know that was probably not what you signed up for. But I am very optimistic that this effort will bootstrap a much more efficient and sustainable framework for evolving the Zarr spec. So thank you again!

@joshmoore
Copy link
Member

A brief update from the rousing community call yesterday evening (my time): it seems that the tendency is towards simplifying the V3 spec and getting it out and then layering this proposal on top in the least breaking way possible.

@joshmoore joshmoore mentioned this pull request Apr 29, 2022
@joshmoore joshmoore merged commit cae3ad3 into zarr-developers:core-protocol-v3.0-dev May 6, 2022
@joshmoore
Copy link
Member

As discussed during recent community meetings and steering council, merged this proposal into the dev branch as a common basis for discussions. The final list of features to be included in v3.0 is to be decided.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants