Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Indicate maturity of implementations? #66

Open
jeromekelleher opened this issue Dec 12, 2023 · 7 comments
Open

Indicate maturity of implementations? #66

jeromekelleher opened this issue Dec 12, 2023 · 7 comments

Comments

@jeromekelleher
Copy link
Member

I've spend the last few days going through the various Zarr implementations trying to create a simple read-oriented benchmark, and have had a pretty frustrating experience. Most of the implementations seems to be in a pretty early proof-of-concept phase, and I think it would be helpful to indicate the how feature-complete implementations are, and whether they have useable documentation etc.

I've almost got a java implementation going based on JZarr, but it seems to lack any form of support for reading in an efficient chunk-aware manner and the API support for getting at the ND array values is pretty limited.

(This is probably not the forum, but some advice on the best way to make a such a benchmark not using zarr-python, or advice on where I might ask for such advice would be much appreciated!)

@d-v-b
Copy link

d-v-b commented Dec 12, 2023

@jeromekelleher sorry to hear about these issues. Unfortunately, because the different Zarr implementations are independent entities, I don't think it's easy to gather exhaustive, accurate information about their feature-completeness in one place (e.g., this repo). We could get part of the way there by periodically checking out the source code for those repos, building the code, and running the code against some benchmark suite. I think this could be really cool, if someone has the time to set that pipeline up. But even a library that passes these kind of tests could still have the API issues you are experiencing with jzarr.

What's the goal of your benchmark? You might be interested in a lively discussion over in zarr-python about how to speed up that library: zarr-developers/zarr-python#1479

@d-v-b
Copy link

d-v-b commented Dec 12, 2023

In particular, it looks like this repo is missing regular test runs, and a place to display the output of the test results, e.g. a docs page.

@jeromekelleher
Copy link
Member Author

I agree having exhaustive feature completeness scores would be quite a chore, and very hard to keep up to date. What's not obvious from the current list though is that most of these implementations are really just proof-of-concepts, and not actually intended for other people to build real applications on.

Even just a "intended for production use" tick would be really helpful and would have saved me a lot of time. The page on the website (https://zarr.dev/implementations/) is giving the impression that all these implementations are on the same footing as zarr-python, which ultimately isn't helpful because people might randomly try a few implementations and come to the (false!) impression that the entire Zarr ecosystem is half-baked.

What's the goal of your benchmark?

I'm writing a paper about sgkit, which is a (essentially) trying to bring the pydata ecosystem to the analysis of genetic variation data. We use Zarr to store the data, which I would like to emphasise is independent of sgkit and Python, as I feel that Zarr provides practical and pragmatic solutions to fundamental problems that large-scale genomics is currently struggling with. To make this point, I want to do a simple benchmark which essentially just reads through a terabyte scale dataset, doing some very simple calculations on it. The people I most want to reach here tend to be a little Python-sceptical, so hence I would like to do the benchmark in a language that is not Python.

@keller-mark
Copy link

Even just a "intended for production use" tick

Perhaps a simpler solution would be to add the current version next to the name, with the assumption that implementations below v1.0.0 are less-mature. The table could be ordered according to version as well.

@d-v-b
Copy link

d-v-b commented Feb 15, 2024

Perhaps a simpler solution would be to add the current version next to the name, with the assumption that implementations below v1.0.0 are less-mature. The table could be ordered according to version as well.

Even if this suggested change were made, given the rate of activity on this repo (the last commit was 2 years ago) it would quickly become out of date. I think the bigger problem here is a demographic one -- nobody is actually working to keep this information up to date.

@keller-mark
Copy link

keller-mark commented Feb 15, 2024

I was referring to the table at https://github.com/zarr-developers/zarr-developers.github.io/blob/main/implementations/index.md - though I agree it would become out of date if done manually - perhaps a github actions workflow could be developed

@keller-mark
Copy link

A pre-v1 vs post-v1 could also work and be less prone to become out of date

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants