Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zarr validation and consistency checking #912

Closed
satra opened this issue Dec 16, 2021 · 13 comments
Closed

zarr validation and consistency checking #912

satra opened this issue Dec 16, 2021 · 13 comments
Labels
enhancement New features or improvements help wanted Issue could use help from someone with familiarity on the topic

Comments

@satra
Copy link

satra commented Dec 16, 2021

i could not find any explicit reference in the documentation to validating a zarr store, hence opening this issue.

we are supporting zarr nested directory stores as a file type for our data archive and looking to validate and inspect the structure of the input before upload. some questions have come up that i am posting here:

  • is it simply sufficient to use the zarr reader to open a zarr store, and if it opens, the zarr store is valid, if not it will raise an exception?
  • if someone accidentally puts in extra files in the directory, is there a way for us to detect these files as not relevant to the store?
  • does the zarr python library already have a consistency check util? and can it detect which underlying directory elements are different, through some form of a tree hash for example ?
@jwodder
Copy link

jwodder commented Dec 16, 2021

I've found the answer to the first question already: an invalid zarr will not always raise an error immediately upon opening. For example, if a chunk file is malformed, this won't be detected until you actually try to use the containing array.

@rabernat
Copy link
Contributor

Welcome and thanks for the questions.

For example, if a chunk file is malformed, this won't be detected until you actually try to use the containing array.

This is definitely deliberate behavior. Zarr arrays can be petabytes in size with millions of chunks! Individually checking each chunk on opening would not be the right default behavior. Missing chunks are valid in Zarr as well--they represent missing data.

  • is it simply sufficient to use the zarr reader to open a zarr store, and if it opens, the zarr store is valid, if not it will raise an exception?

Depends on what you mean by "valid"? If you are asking if the store can be opened by Zarr, then yes, this is sufficient. If you are asking whether your data have been corrupted, then no. You may consider using array.hexdigest to verify data integrity.

  • if someone accidentally puts in extra files in the directory, is there a way for us to detect these files as not relevant to the store?

Zarr will just ignore those files. I don't think they'll break anything.

  • does the zarr python library already have a consistency check util? and can it detect which underlying directory elements are different, through some form of a tree hash for example ?

See comments above about hexdigest. Also ongoing discussions in #877.

@satra
Copy link
Author

satra commented Dec 16, 2021

@rabernat - thank you.

ah hexdigest would apply as an overall checksum. we can compute it, but hexdigest could potentially be a very expensive operation. good to know it exists. we are (at least for our backend on s3) working on a tree-hash scheme to store checksums associated with every file and "directory" in the tree.

if zarr ignores any irrelevant files we may even consider computing and storing the checksums locally or in some zipped checksum store (to prevent inode explosion) and if this works we may propose a tree hash scheme for diff detection. if you already have any conversations on diff detection, would love to know.

sharding support would be fantastic and would really help optimize the nested directory structure to minimize the number of files. i'm hoping this won't break any xarray type access when it's implemented and would be transparent to any end user. given the datasets we are handling the current recommended chunk size is 64**3 and that's resulting in about a million files per zarr store.

@rabernat
Copy link
Contributor

we are (at least for our backend on s3) working on a tree-hash scheme to store checksums associated with every file and "directory" in the tree.

In that case you may be interested in the conversation in #392 (comment) and zarr-developers/zarr-specs#82. IPFS solves this problem very elegantly, and a lot of us are interested in plugging Zarr into IPFS.

@satra
Copy link
Author

satra commented Dec 16, 2021

In that case you may be interested in the conversation in #392 (comment) and zarr-developers/zarr-specs#82. IPFS solves this problem very elegantly, and a lot of us are interested in plugging Zarr into IPFS.

i love ipfs (at least the concept), the efficiency is not quite there yet for practical use. yes, ipfs would solve several of these things. we have a bottleneck in that ipfs would require a client running in front of it, and since we are using a public dataset program, we have some constraints in terms of how to support it. we are indeed considering ipfs (or its variants) as a part of an institutional infrastructure across universities. i'll check in on those conversations.

@joshmoore
Copy link
Member

joshmoore commented Dec 16, 2021

Hi @satra,

A few quick answers while we see if anyone else in the community has built anything.

  • is it simply sufficient to use the zarr reader to open a zarr store, and if it opens, the zarr store is valid, if not it will raise an exception?

In terms of the metadata, I'd believe so. zarr-python tends to be fairly lenient about the chunks until access (and missing chunks are considered legitmate)

  • if someone accidentally puts in extra files in the directory, is there a way for us to detect these files as not relevant to the store?

The files that are relevant to the store are quite limited. If you everything but ^\d+$ and ^[.]z.*$ you should be alright.

  • does the zarr python library already have a consistency check util? and can it detect which underlying directory elements are different, through some form of a tree hash for example ?

Not that I know of. See also #392

Edit: interesting! I didn't see any of the previous responses when I was responding...

@joshmoore joshmoore added enhancement New features or improvements help wanted Issue could use help from someone with familiarity on the topic labels Dec 21, 2022
@jakirkham
Copy link
Member

It looks like this is now being addressed by zarr_checksum. Is that right @satra?

Also worth noting is digest and hexdigest in Zarr.

@satra
Copy link
Author

satra commented Feb 4, 2023

@jakirkham - indeed that's a tree hash algo we implemented for our needs and using that digest for files in dandi. it's a pure object based hash with no semantics. we may in the future also want to consider an isomorphic hash, where the bits can change, but the content is the same (e.g. moving from uint8 to uint16).

also given the sizes of file trees, we may want to consider ways to optimize both hash check and diff detection.

i'll close this for now. i had completely forgotten about this issue, so thank you @jakirkham

@satra satra closed this as completed Feb 4, 2023
@d-v-b
Copy link
Contributor

d-v-b commented Jul 14, 2023

@satra you might be interested in pydantic-zarr. It's designed to normatively represent zarr hierarchies. I think some of the things you are looking for could be built with this library, and its very small (right now), so you could just implement the same functionality in your own tooling very easily without adding it as a dependency.

@satra
Copy link
Author

satra commented Jul 14, 2023

thanks @d-v-b looks nice and would be easy to incorporate since we already have a pydantic based setup for our schema.

a possibility that we are experimenting with in a few projects is to use linkml that abstracts out the metadata model into a yaml definition and then uses generators to create various toolkits (amongst it pydantic). there are many little issues at this point, but they have effectively collapsed a lot of the patterns we use across projects into a single markup language + generators.

@d-v-b
Copy link
Contributor

d-v-b commented Jul 14, 2023

is there anything specific you'd need from zarr-python to make this easier? something on my wishlist is a specification for a JSON-serializable representation of a zarr hierarchy, which would make pydantic-zarr merely one implementation of that spec.

@satra
Copy link
Author

satra commented Oct 1, 2023

@d-v-b - sorry for the very late response. indeed linkml's data model would allow that and i know some of the linkml folks are in conversation with the NWB folks regarding array data type in linkml as well. here is an intro talk covering basics of linkml: https://zenodo.org/record/7778641

i think it would be a good opportunity to turn the zarr spec into a data model that may fit in with many different worlds of use cases.

@melonora
Copy link

melonora commented Oct 4, 2023

This is indeed actively being worked upon within the LinkML team at the moment. Just tagging @rly who is currently involved in this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New features or improvements help wanted Issue could use help from someone with familiarity on the topic
Projects
None yet
Development

No branches or pull requests

7 participants