Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NCZarr does not support reading of many zarr files #2474

Open
dzenanz opened this issue Aug 8, 2022 · 31 comments
Open

NCZarr does not support reading of many zarr files #2474

dzenanz opened this issue Aug 8, 2022 · 31 comments
Assignees
Milestone

Comments

@dzenanz
Copy link
Contributor

dzenanz commented Aug 8, 2022

I am trying to read sample zarr files provided here: https://github.com/ome/ome-ngff-prototypes#data-availability

dzenan@corista:~/tester$ ncdump -h file:///media/xeonatorC/Dev/ITKIOOMEZarrNGFF/v0.4/tcyx.ome.zarr#mode=nczarr,file
ncdump: file:///media/xeonatorC/Dev/ITKIOOMEZarrNGFF/v0.4/tcyx.ome.zarr#mode=nczarr,file: NetCDF: NCZarr error
dzenan@corista:~/tester$ ncdump -h file:///media/xeonatorC/Dev/ITKIOOMEZarrNGFF/v0.4/tcyx.ome.zarr#mode=nczarr
ncdump: file:///media/xeonatorC/Dev/ITKIOOMEZarrNGFF/v0.4/tcyx.ome.zarr#mode=nczarr: NetCDF: NCZarr error
dzenan@corista:~/tester$ ncdump -h file:///media/xeonatorC/Dev/ITKIOOMEZarrNGFF/v0.4/tcyx.ome.zarr#mode=zarr,file
ncdump: file:///media/xeonatorC/Dev/ITKIOOMEZarrNGFF/v0.4/tcyx.ome.zarr#mode=zarr,file: NetCDF: NCZarr error
dzenan@corista:~/tester$ ncdump -h file:///media/xeonatorC/Dev/ITKIOOMEZarrNGFF/v0.4/cyx.ome.zarr#mode=zarr,file
ncdump: file:///media/xeonatorC/Dev/ITKIOOMEZarrNGFF/v0.4/cyx.ome.zarr#mode=zarr,file: NetCDF: NCZarr error
dzenan@corista:~/tester$ ncdump -h file:///media/xeonatorC/Dev/ITKIOOMEZarrNGFF/v0.4/tczyx.ome.zarr#mode=zarr,file
free(): double free detected in tcache 2
Aborted (core dumped)
dzenan@corista:~/tester$ ncdump -h file:///media/xeonatorC/Dev/ITKIOOMEZarrNGFF/v0.4/tyx.ome.zarr#mode=zarr,file
ncdump: file:///media/xeonatorC/Dev/ITKIOOMEZarrNGFF/v0.4/tyx.ome.zarr#mode=zarr,file: NetCDF: NCZarr error
dzenan@corista:~/tester$ ncdump -h file:///media/xeonatorC/Dev/ITKIOOMEZarrNGFF/v0.4/yx.ome.zarr#mode=zarr,file
ncdump: file:///media/xeonatorC/Dev/ITKIOOMEZarrNGFF/v0.4/yx.ome.zarr#mode=zarr,file: NetCDF: NCZarr error
dzenan@corista:~/tester$ ncdump -h file:///media/xeonatorC/Dev/ITKIOOMEZarrNGFF/v0.4/zyx.ome.zarr#mode=zarr,file
free(): double free detected in tcache 2
Aborted (core dumped)
dzenan@corista:~/tester$ ncdump -h file:///media/xeonatorC/Dev/ITKIOOMEZarrNGFF/v0.4/multi-image.ome.zarr#mode=zarr,file
ncdump: file:///media/xeonatorC/Dev/ITKIOOMEZarrNGFF/v0.4/multi-image.ome.zarr#mode=zarr,file: NetCDF: NCZarr error
dzenan@corista:~/tester$ ncdump -h https://s3.embl.de/i2k-2020/ngff-example-data/v0.4#mode=zarr,file
ncdump: https://s3.embl.de/i2k-2020/ngff-example-data/v0.4#mode=zarr,file: NetCDF: Malformed URL
dzenan@corista:~/tester$ ncdump -h https://s3.embl.de/i2k-2020/ngff-example-data/v0.4
syntax error, unexpected WORD_WORD, expecting SCAN_ATTR or SCAN_DATASET or SCAN_ERROR
context: <?xml^ version="1.0" encoding="UTF-8"?><Error><Code>NoSuchKey</Code><Message>The specified key does not exist.</Message><Key>ngff-example-data/v0.4.dds</Key><BucketName>i2k-2020</BucketName><Resource>/i2k-2020/ngff-example-data/v0.4.dds</Resource><RequestId>17096CA8F2A3955C</RequestId><HostId>083e6bd6-5f67-4d7a-bf85-13f52b390395</HostId></Error>
ncdump: https://s3.embl.de/i2k-2020/ngff-example-data/v0.4: NetCDF: file not found
dzenan@corista:~/tester$ ncdump -h s3:https://s3.embl.de/i2k-2020/ngff-example-data/v0.4
ncdump: s3:https://s3.embl.de/i2k-2020/ngff-example-data/v0.4: No such file or directory
dzenan@corista:~/tester$ ncdump -h s3://s3.embl.de/i2k-2020/ngff-example-data/v0.4
ncdump: s3://s3.embl.de/i2k-2020/ngff-example-data/v0.4: NetCDF: Attempt to use feature that was not turned on when netCDF was built.

Am I doing something wrong? I was under impression that NCZarr would read most zarr files. What are the offending features in these examples which break NCZarr?

Ubuntu 22.04.1 LTS, GCC 11.2.0, libnetcdf-dev Version: 1:4.8.1-1.

@dzenanz
Copy link
Contributor Author

dzenanz commented Aug 10, 2022

@DennisHeimbigner do you have some insight?

@DennisHeimbigner
Copy link
Collaborator

Sorry for the delay. It appears to be related to the use of '/' as the
dimension separator. I was trying to track down exactly what i happening.

@DennisHeimbigner
Copy link
Collaborator

I was wrong. The problem is this (its complicated).

  1. array "s0" meta-data specifies this:
    • shape is [4,930,1024]
    • _ARRAY_DIMENSIONS are ["c","y","x"]
  2. array "s1" meta-data specifies this:
    • shape is [4,465,512]
    • _ARRAY_DIMENSIONS are ["c","y","x"]

Since both are in the same group (the root group), this is an inconsistency
because one variable says y=930 and the other says y=465.
You might contact the Xarray community to see how this contradiction
should be resolved.

@dzenanz
Copy link
Contributor Author

dzenanz commented Aug 10, 2022

OME NGFF specifies that "Metadata about an image can be found under the "multiscales" key in the group-level metadata." As far as I can tell, multiple arrays are designed to have different dimensions to represent different levels of the image pyramid. @joshmoore Am I interpreting this correctly?

This might be contrary to NCZarr's constraints? Is there a way to reconcile this? Or is NCZarr a bad choice for Zarr-backend for NGFF implementation?

@DennisHeimbigner
Copy link
Collaborator

I do not know anything about the OME standard, but why do you think
it is generating legal Zarr Version 2 files, never mind NCZarr.

@joshmoore
Copy link

Since both are in the same group (the root group), this is an inconsistency
because one variable says y=930 and the other says y=465.
You might contact the Xarray community to see how this contradiction
should be resolved.

@DennisHeimbigner: but that should only be the case for mode=nczarr, no? There's no such restriction in Zarr itself.

As a side note, @dzenanz, ome/ngff#114 is intended to play nicely with xarray and nczarr.

@DennisHeimbigner
Copy link
Collaborator

It is actually an XArray problem since the _ARRAY_DIMENSIONS attribute
is not recognized in Zarr V2 (or has that changed?).
But it should be the case that the names assigned to shapes
should be consistent across all uses of those dimensions names, correct?

@joshmoore
Copy link

It is actually an XArray problem since the _ARRAY_DIMENSIONS attribute
is not recognized in Zarr V2 (or has that changed?)

It's definitely true that Zarr v2 is agnostic to _ARRAY_DIMENIONS, but why should that impact mode=zarr filesets?

But it should be the case that the names assigned to shapes
should be consistent across all uses of those dimensions names, correct?

Not in Zarr.

@DennisHeimbigner
Copy link
Collaborator

I was speaking more about nczarr and Xarray, than Zarr, because Zarr
has no dimension names at all.
Perhaps the interesting question is: what happens when that file
is read by the Xarray package? What values does Xarray assign to
the dimension names?

@dzenanz
Copy link
Contributor Author

dzenanz commented Aug 10, 2022

I assume that the highest resolution is used by default in Xarray.

@DennisHeimbigner
Copy link
Collaborator

In any case, try changing the access url to include the following:
".......#mode=zarr,noxarray,ZZZ"
where ZZZ is either file or zip or s3 depending on what storage format
you are reading.

@DennisHeimbigner
Copy link
Collaborator

I assume that the highest resolution is used by default in Xarray.

I would be surprised if that was so. It would be seen as violating shape constraints,
I would think.

@dzenanz
Copy link
Contributor Author

dzenanz commented Aug 10, 2022

dzenan@corista:~/tester$ /usr/local/bin/ncdump -h file:///media/xeonatorC/Misc/Tester/CBCT_crop1ms.zarr#mode=zarr,noxarray,file
netcdf CBCT_crop1ms {
dimensions:
	_zdim_56 = 56 ;
	_zdim_63 = 63 ;
	_zdim_67 = 67 ;
	_zdim_28 = 28 ;
	_zdim_31 = 31 ;
	_zdim_33 = 33 ;
	_zdim_14 = 14 ;
	_zdim_15 = 15 ;
	_zdim_16 = 16 ;

group: scale0 {
NetCDF: internal library error; Please contact Unidata support
Location: file ; line 1661

@dzenanz
Copy link
Contributor Author

dzenanz commented Aug 10, 2022

@DennisHeimbigner you are right. xarray.open_zarr() yields:

<xarray.Dataset>
Dimensions:  ()
Data variables:
    *empty*

@DennisHeimbigner
Copy link
Collaborator

I would suggest two changes to the OME format:

  1. make the dimension-name <-> shape mapping be consistent
  2. Change the "direction" attribute from this:
    "direction": [ [ 1.0, 0.0, 0.0 ], [ 0.0, 1.0, 0.0 ], [ 0.0, 0.0, 1.0 ] ]
    to something like this.
    "direction": {"x": [1.0,0.0,0.0], "y": [0.0, 1.0, 0.0], "z": [0.0, 0.0, 1.0]}

There may be other attributes that also should be changed.

@joshmoore
Copy link

I would suggest two changes to the OME format:

Suggestion noted, thanks, @DennisHeimbigner, but I don't think we should open that conversation here.

Leaving nczarr, s3, and http out of the picture for a moment, I'm still struggling to understand whether you think these file-based pure-zarr examples from the original description should be covered by netcdf-c, @DennisHeimbigner:

dzenan@corista:~/tester$ ncdump -h file:///media/xeonatorC/Dev/ITKIOOMEZarrNGFF/v0.4/tcyx.ome.zarr#mode=zarr,file
ncdump: file:///media/xeonatorC/Dev/ITKIOOMEZarrNGFF/v0.4/tcyx.ome.zarr#mode=zarr,file: NetCDF: NCZarr error

dzenan@corista:~/tester$ ncdump -h file:///media/xeonatorC/Dev/ITKIOOMEZarrNGFF/v0.4/cyx.ome.zarr#mode=zarr,file
ncdump: file:///media/xeonatorC/Dev/ITKIOOMEZarrNGFF/v0.4/cyx.ome.zarr#mode=zarr,file: NetCDF: NCZarr error

dzenan@corista:~/tester$ ncdump -h file:///media/xeonatorC/Dev/ITKIOOMEZarrNGFF/v0.4/tczyx.ome.zarr#mode=zarr,file
free(): double free detected in tcache 2
Aborted (core dumped)

dzenan@corista:~/tester$ ncdump -h file:///media/xeonatorC/Dev/ITKIOOMEZarrNGFF/v0.4/tyx.ome.zarr#mode=zarr,file
ncdump: file:///media/xeonatorC/Dev/ITKIOOMEZarrNGFF/v0.4/tyx.ome.zarr#mode=zarr,file: NetCDF: NCZarr error

dzenan@corista:~/tester$ ncdump -h file:///media/xeonatorC/Dev/ITKIOOMEZarrNGFF/v0.4/yx.ome.zarr#mode=zarr,file
ncdump: file:///media/xeonatorC/Dev/ITKIOOMEZarrNGFF/v0.4/yx.ome.zarr#mode=zarr,file: NetCDF: NCZarr error

dzenan@corista:~/tester$ ncdump -h file:///media/xeonatorC/Dev/ITKIOOMEZarrNGFF/v0.4/zyx.ome.zarr#mode=zarr,file
free(): double free detected in tcache 2
Aborted (core dumped)

dzenan@corista:~/tester$ ncdump -h file:///media/xeonatorC/Dev/ITKIOOMEZarrNGFF/v0.4/multi-image.ome.zarr#mode=zarr,file
ncdump: file:///media/xeonatorC/Dev/ITKIOOMEZarrNGFF/v0.4/multi-image.ome.zarr#mode=zarr,file: NetCDF: NCZarr error

In my mind, they should especially if we are pointing the community to this as the C implementation.

@DennisHeimbigner
Copy link
Collaborator

DennisHeimbigner commented Aug 11, 2022

....these file-based pure-zarr examples from the original description should be covered by netcdf-c,

Speaking for myself, I say no. The fact that they cannot be read by Xarray
makes it clear that having an inconsistent _ARRAY_DIMENSION <-> shape
mapping is just wrong, and needs to be fixed.

As for the case of

"direction": [ [ 1.0, 0.0, 0.0 ], [ 0.0, 1.0, 0.0 ], [ 0.0, 0.0, 1.0 ] ]

that is more a matter of opinion and I would really think that using a dictionary
instead of an array of arrays provides for more meaningful metadata.
GDAL (https://gdal.org/drivers/raster/zarr.html), for instance, has non-standard attribute values whose value is a dictionary.
I think that should be the convention for non-standard attributes.

@WardF WardF added this to the 4.9.1 milestone Aug 19, 2022
DennisHeimbigner added a commit to DennisHeimbigner/netcdf-c that referenced this issue Aug 28, 2022
* re: Unidata#2278
* re: Unidata#2485
* re: Unidata#2474

This PR subsumes PR Unidata#2278.
Actually is a bit an omnibus covering several issues.

## PR Unidata#2278
Add support for the Zarr string type.
Zarr strings are restricted currently to be of fixed size.
The primary issue to be addressed is to provide a way for user to
specify the size of the fixed length strings. This is handled by providing
the following new attributes special:
1. **_nczarr_default_maxstrlen** &mdash;
This is an attribute of the root group. It specifies the default
maximum string length for string types. If not specified, then
it has the value of 64 characters.
2. **_nczarr_maxstrlen** &mdash;
This is a per-variable attribute. It specifies the maximum
string length for the string type associated with the variable.
If not specified, then it is assigned the value of
**_nczarr_default_maxstrlen**.

This PR also requires some hacking to handle the existing netcdf-c NC_CHAR
type, which does not exist in zarr. The goal was to choose numpy types for
both the netcdf-c NC_STRING type and the netcdf-c NC_CHAR type such that
if a pure zarr implementation read them, it would still work and an
NC_CHAR type would be handled by zarr as a string of length 1.

For writing variables and NCZarr attributes, the type mapping is as follows:
* "|S1" for NC_CHAR.
* ">S1" for NC_STRING && MAXSTRLEN==1
* ">Sn" for NC_STRING && MAXSTRLEN==n

Note that it is a bit of a hack to use endianness, but it should be ok since for
string/char, the endianness has no meaning.

For reading attributes with pure zarr (i.e. with no nczarr
atribute types defined), they will always be interpreted as of
type NC_CHAR.

## Issue: Unidata#2474
This PR partly fixes this issue because it provided more
comprehensive support for Zarr attributes that are JSON valued expressions.
This PR still does not address the problem in that issue where the
_ARRAY_DIMENSION attribute is incorrectly set. Than can only be
fixed by the creator of the datasets.

## Issue: Unidata#2485
This PR also fixes the scalar failure shown in this issue.
It generally cleans up scalar handling.
It also adds a note to the documentation describing that
NCZarr supports scalars while Zarr does not and also how
scalar interoperability is achieved.

## Misc. Other Changes
1. Convert the nczarr special attributes and keys to be all lower case. So "_NCZARR_ATTR" now used "_nczarr_attr. Support back compatibility for the upper case names.
2. Cleanup my too-clever-by-half handling of scalars in libnczarr.
@DennisHeimbigner
Copy link
Collaborator

Partly fixed by #2492

@joshmoore
Copy link

Speaking for myself, I say no. The fact that they cannot be read by Xarray
makes it clear that having an inconsistent _ARRAY_DIMENSION <-> shape
mapping is just wrong, and needs to be fixed.

@DennisHeimbigner, to clarify: is the intent for netcdf-c to be considered a Zarr library (as opposed to an nczarr library)?

Partly fixed by #2492

@dzenanz: will you have a chance to test if the partial fix works for you?

@dzenanz
Copy link
Contributor Author

dzenanz commented Sep 5, 2022

I am finishing up a grant proposal today and tomorrow, I plan to test #2492 later in the week.

@DennisHeimbigner
Copy link
Collaborator

,,,to clarify: is the intent for netcdf-c to be considered a Zarr library (as opposed to an nczarr library)?

The primary goal is to read/write netcdf-4 stored in Zarr format datasets.
But since netcdf-4 does not exactly map to Zarr v2, it is necessary
to add optional extensions to Zarr to support netcdf-4 concepts.

On the other hand, we do not want to cut ourselves off from pure Zarr
datasets. So, a sub-goal is to support mapping a subset of netcdf-4
to pure Zarr. This sub-goal allows us to access pure Zarr datasets thru
the netCDF API.

@joshmoore
Copy link

Thanks for the explanation, @DennisHeimbigner. I definitely understand the primary goal. But is there also a sub-goal of mapping all pure Zarr datasets to some subset of netcdf-4?

@DennisHeimbigner
Copy link
Collaborator

...is there also a sub-goal of mapping all pure Zarr datasets to some subset of netcdf-4?

Short answer is yes. For example, I have just added support for Zarr fixed length
string type. There are a couple of other things I will be adding when there
specification is stable: .zmetadata (aggregate metadata) and sharding.
If there is something you would like to see that I have missed, then let me know.

@joshmoore
Copy link

joshmoore commented Sep 6, 2022

I guess I defer there to @dzenanz' testing since in #2474 (comment) there appear to be Zarr datasets that aren't openable. If it would help to copy them somewhere for testing, happy to do so.

@dzenanz
Copy link
Contributor Author

dzenanz commented Sep 8, 2022

Trying to read C:\Dev\ITKIOOMEZarrNGFF\v0.4\cyx.ome.zarr with updated version of netCDF still produces the same error: NetCDF: NCZarr error when invoking nc_open. Same happens for the other files there (yx.ome.zarr and multi-image.ome.zarr). I am using a slightly modified netCDF: https://github.com/dzenanz/netcdf-c/tree/itkSpecific.

@DennisHeimbigner
Copy link
Collaborator

That is because the OME use of _ARRAY_DIMENSION is still wrong and I
cannot fix that particular problem.

@dzenanz
Copy link
Contributor Author

dzenanz commented Sep 8, 2022

Removing either _ARRAY_DIMENSION attributes or leaving just one scale gets me further: NetCDF: Filter error: undefined filter encountered. That is understandable, since the array has:

"compressor": {
    "blocksize": 0,
    "clevel": 5,
    "cname": "lz4",
    "id": "blosc",
    "shuffle": 1
}

@DennisHeimbigner
Copy link
Collaborator

How did you install netcdf-c? Are you building it yourself?

@dzenanz
Copy link
Contributor Author

dzenanz commented Sep 9, 2022

To have the latest version, I have to build it myself. My build setup is here: InsightSoftwareConsortium/ITKIOOMEZarrNGFF#6
Here, I am trying to add zip support to the build: dzenanz/ITKIOOMEZarrNGFF@0bbea25
the current stumbling block is here: dzenanz/ITKIOOMEZarrNGFF@0bbea25#diff-1e7de1ae2d059d21e1dd75d5812d5a34b0222cef273b7c3a2af62eb747f9d20aR135-R137

@DennisHeimbigner
Copy link
Collaborator

So it looks like you are building with your own CMakeLists.txt and not
using the one provided by netcdf, correct?

@dzenanz
Copy link
Contributor Author

dzenanz commented Sep 9, 2022

I modified netCDF's CMakeLists.txt, in order to point to HDF5 bundled with ITK, and to avoid some duplicate targets with other libraries I use together.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Status: Todo
Development

No branches or pull requests

4 participants