Indicate maturity of implementations? #66

jeromekelleher · 2023-12-12T17:04:29Z

I've spend the last few days going through the various Zarr implementations trying to create a simple read-oriented benchmark, and have had a pretty frustrating experience. Most of the implementations seems to be in a pretty early proof-of-concept phase, and I think it would be helpful to indicate the how feature-complete implementations are, and whether they have useable documentation etc.

I've almost got a java implementation going based on JZarr, but it seems to lack any form of support for reading in an efficient chunk-aware manner and the API support for getting at the ND array values is pretty limited.

(This is probably not the forum, but some advice on the best way to make a such a benchmark not using zarr-python, or advice on where I might ask for such advice would be much appreciated!)

d-v-b · 2023-12-12T17:35:44Z

@jeromekelleher sorry to hear about these issues. Unfortunately, because the different Zarr implementations are independent entities, I don't think it's easy to gather exhaustive, accurate information about their feature-completeness in one place (e.g., this repo). We could get part of the way there by periodically checking out the source code for those repos, building the code, and running the code against some benchmark suite. I think this could be really cool, if someone has the time to set that pipeline up. But even a library that passes these kind of tests could still have the API issues you are experiencing with jzarr.

What's the goal of your benchmark? You might be interested in a lively discussion over in zarr-python about how to speed up that library: zarr-developers/zarr-python#1479

d-v-b · 2023-12-12T17:43:48Z

In particular, it looks like this repo is missing regular test runs, and a place to display the output of the test results, e.g. a docs page.

jeromekelleher · 2023-12-13T09:59:15Z

I agree having exhaustive feature completeness scores would be quite a chore, and very hard to keep up to date. What's not obvious from the current list though is that most of these implementations are really just proof-of-concepts, and not actually intended for other people to build real applications on.

Even just a "intended for production use" tick would be really helpful and would have saved me a lot of time. The page on the website (https://zarr.dev/implementations/) is giving the impression that all these implementations are on the same footing as zarr-python, which ultimately isn't helpful because people might randomly try a few implementations and come to the (false!) impression that the entire Zarr ecosystem is half-baked.

What's the goal of your benchmark?

I'm writing a paper about sgkit, which is a (essentially) trying to bring the pydata ecosystem to the analysis of genetic variation data. We use Zarr to store the data, which I would like to emphasise is independent of sgkit and Python, as I feel that Zarr provides practical and pragmatic solutions to fundamental problems that large-scale genomics is currently struggling with. To make this point, I want to do a simple benchmark which essentially just reads through a terabyte scale dataset, doing some very simple calculations on it. The people I most want to reach here tend to be a little Python-sceptical, so hence I would like to do the benchmark in a language that is not Python.

keller-mark · 2024-02-15T13:57:41Z

Even just a "intended for production use" tick

Perhaps a simpler solution would be to add the current version next to the name, with the assumption that implementations below v1.0.0 are less-mature. The table could be ordered according to version as well.

d-v-b · 2024-02-15T14:03:08Z

Perhaps a simpler solution would be to add the current version next to the name, with the assumption that implementations below v1.0.0 are less-mature. The table could be ordered according to version as well.

Even if this suggested change were made, given the rate of activity on this repo (the last commit was 2 years ago) it would quickly become out of date. I think the bigger problem here is a demographic one -- nobody is actually working to keep this information up to date.

keller-mark · 2024-02-15T14:06:09Z

I was referring to the table at https://github.com/zarr-developers/zarr-developers.github.io/blob/main/implementations/index.md - though I agree it would become out of date if done manually - perhaps a github actions workflow could be developed

keller-mark · 2024-02-15T14:07:13Z

A pre-v1 vs post-v1 could also work and be less prone to become out of date

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Indicate maturity of implementations? #66

Indicate maturity of implementations? #66

jeromekelleher commented Dec 12, 2023

d-v-b commented Dec 12, 2023

d-v-b commented Dec 12, 2023

jeromekelleher commented Dec 13, 2023

keller-mark commented Feb 15, 2024

d-v-b commented Feb 15, 2024

keller-mark commented Feb 15, 2024 •

edited

Loading

keller-mark commented Feb 15, 2024

Indicate maturity of implementations? #66

Indicate maturity of implementations? #66

Comments

jeromekelleher commented Dec 12, 2023

d-v-b commented Dec 12, 2023

d-v-b commented Dec 12, 2023

jeromekelleher commented Dec 13, 2023

keller-mark commented Feb 15, 2024

d-v-b commented Feb 15, 2024

keller-mark commented Feb 15, 2024 • edited Loading

keller-mark commented Feb 15, 2024

keller-mark commented Feb 15, 2024 •

edited

Loading