Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow stores to be smarter when accessing multiple chunks #547

Open
bilts opened this issue Mar 22, 2020 · 1 comment
Open

Allow stores to be smarter when accessing multiple chunks #547

bilts opened this issue Mar 22, 2020 · 1 comment

Comments

@bilts
Copy link

bilts commented Mar 22, 2020

I'd like to help with this, but I would appreciate input on the approach to increase likelihood of merge.

The gist is that I'd like some way to allow (but not require) stores to be smarter about batching or parallelizing chunk access. Stores may know something that zarr doesn't about how chunks are arranged and may be able to optimize better than the zarr library.

Concretely, #535 will allow Zarr to read remote NetCDF4 / HDF5 files using Range GET operations. In the wild, these files often require many, many chunk reads of (near-)contiguous bytes for common operations. A smarter store could merge contiguous byte ranges into single requests and provide multiple ranges in a single request, as the Range header allows.

The core issue is that Zarr's current implementation asks a store for chunks one key at a time, synchronously. I see three possible ways to fix this, each with pros and cons, and it's possible I'm missing others:

Option 1: Allow users to request concurrency

PR #534 implements this. A sufficiently smart store could collect concurrent requests before sending them out and merge as appropriate.

Pros:

  • Already largely implemented
  • Low-impact to current interfaces
  • Broad benefits to stores existing stores with no additional changes to the stores themselves

Cons:

  • Requires overhead and limitations of threads. Imagine an array split into 10,000 chunks. That ends up being 10,000 async ops that need to be scattered only to be immediately gathered in a thread-safe way.
  • Requires users to ask for concurrency when there is no down-side to the store always providing it
  • Requires the backing store to do a fair amount of complex thread-aware work, possibly waiting a very brief amount of time before sending out any request to make sure it's appropriately gathering requests.
  • Loses possibly helpful sequencing information, as chunk keys come in in a random order

Option 2: Allow stores to request concurrency

This is almost the same as option 1, but instead of users saying they want concurrency, the store can specify that it wants it via some property, interface, etc.

Pros (vs option 1):

  • Avoids issues when used on stores that are not thread-safe
  • Can be automatic / default

Cons (vs option 1):

  • There may be cases where users would want to be explicit about whether or not to use which type of concurrency.

Option 3: Zarr optionally requests a list of keys rather than one at a time

The idea here would be to allow stores to optionally implement a method, say get_items that accepts a list of keys and returns an iterable response corresponding to their values. Stores not implementing this interface would get the current implementation.

Pros:

  • Much simpler implementation for both Zarr and stores implementing the method
  • More efficient, requiring no pauses or unhelpful threads
  • Can provide additional efficiencies if Zarr specifies, say, that it will specify the keys in ascending order by axis.

Cons:

  • While stores could still just implement MutableMapping, this would introduce an additional interface method that is non-standard
  • May clash with behavior from option 1. What do you do when you've been simultaneously asked for concurrent execution and the store allows you to pass an array of chunks? Note: does not clash with option 2.
  • Less benefit to stores without this access need. It would allow those stores an easy way to provide their own concurrency, but it wouldn't be as turnkey as the prior two options.

Option 4: ???

I'm not an expert on Zarr. Am I missing something?

Please let me know if this is worth pursuing and how you'd like me to proceed.

@alimanfoo
Copy link
Member

Just to say thanks for raising this, important discussion that relates to other discussions also happening elsewhere (e.g., #536). Will follow up with more comments asap.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants