Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Concurrent block reads and writes #534

Closed
wants to merge 2 commits into from

Conversation

llllllllll
Copy link
Contributor

This commit adds support for optional concurrent executors to allow concurrently
reading or writing blocks to the underlying store. For stores where the IO is
expensive, for example an S3Map, allowing concurrent reads can be a massive
performance improvement.

The API is designed around Executor objects to allow safe composition. An
executor may be threaded to all of the places where concurrency is desired, both
inside of Zarr itself and in the user's code, to ensure that a shared thread
pool is used and no more than the desired number of threads are launched at the
same time. Executors would allow users to safely read or write blocks
concurrently from different Zarr Array objects at the same time, without
worrying about accidentally spawning too many threads.

I am opening this PR early before adding tests or more documentation to get feedback this proposed API, and discuss the implications of allowing concurrent access to blocks in a zarr array. I didn't want to invest too much time before getting your feedback to make sure this is something that the zarr developers are interested in.

This API doesn't necessarily allow users to do something new, because they could add concurrency on top of the existing interfaces, but adding this interface makes it much easier to align your reads and writes on chunk boundaries without needing to know as much about the implementation of Zarr. This is especially true for integer indexers and boolean mask indexers.

One major downside of this API is that it makes it easier for a user to attempt to access a non thread safe store in a multithreaded context. A user may naively use a ThreadPoolExecutor when it is not safe to do so and then get unexpected crashes. This could cause hard to debug (or even notice) data corruption depending on the given store. Another issue is that users may open many duplicate issues about this which adds a maintenance burden for the zarr developers.

TODO:

  • Add unit tests and/or doctests in docstrings
  • Add docstrings and API docs for any new/modified user-facing classes and functions
  • New/modified features documented in docs/tutorial.rst
  • Changes documented in docs/release.rst
  • AppVeyor and Travis CI passes
  • Test coverage is 100% (Coveralls passes)

Joe Jevnik added 2 commits February 3, 2020 21:08
This commit adds support for optional concurrent executors to allow concurrently
reading or writing blocks to the underlying store. For stores where the IO is
expensive, for example an S3Map, allowing concurrent reads can be a massive
performance improvement.

The API is designed around Executor objects to allow safe composition. An
executor may be threaded to all of the places where concurrency is desired, both
inside of Zarr itself and in the user's code, to ensure that a shared thread
pool is used and no more than the desired number of threads are launched at the
same time. Executors would allow users to safely read or write blocks
concurrently from different Zarr Array objects at the same time, without
worrying about accidentally spawning too many threads.
@bilts
Copy link

bilts commented Mar 22, 2020

I left some comments in #547 about a use case that may be useful to consider when implementing this. The gist is that it'd be very nice to allow stores to make their own concurrency or other optimizations, but that requires getting them more information.

I think my specific need is implementable with what you have, but it may not be ideal in many ways.

@rabernat rabernat mentioned this pull request Mar 30, 2020
@jakirkham jakirkham mentioned this pull request Aug 15, 2022
@jhamman
Copy link
Member

jhamman commented Dec 7, 2023

This has obviously gone stale (appreciate the effort here @llllllllll) but I would point those following it to take a look at #1583 where we layout a design for concurrent actions across the library.

@jhamman jhamman closed this Dec 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants