Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dimension names as core array metadata #73

Closed
alimanfoo opened this issue May 21, 2020 · 28 comments
Closed

Dimension names as core array metadata #73

alimanfoo opened this issue May 21, 2020 · 28 comments
Labels
core-protocol-v3.0 Issue relates to the core protocol version 3.0 spec todo pre-rfc

Comments

@alimanfoo
Copy link
Member

Several domains make use of named dimensions, i.e., for a given array with N dimensions, each of those N dimensions is given a human-readable name.

Given the broad utility of this, should we include this within the core array metadata in the v3 protocol? E.g., add a dimensions property within the array metadata document, whose value should be a list of strings:

    "shape": [10000, 1000],
    "dimensions": ["space", "time"],
    "data_type": "<f8",
    "chunk_grid": {
        "type": "regular",
        "chunk_shape": [1000, 100]
    },
    "chunk_memory_layout": "C",
    "compressor": {
        "codec": "https://purl.org/zarr/spec/codec/gzip/1.0",
        "configuration": {
            "level": 1
        }
    },
    "fill_value": "NaN",
    "extensions": [],
    "attributes": {
        "foo": 42,
        "bar": "apples",
        "baz": [1, 2, 3, 4]
    }
}

One question this raises is how to handle the case where no names are provided, or only some dimensions are named but not others. I.e., dimension names should probably be optional.

The alternative is that we leave this to the community to define a usage convention to store dimension names in the user attributes, e.g., similar to what xarray currently does using the "_ARRAY_DIMENSIONS" attribute name.

@alimanfoo alimanfoo added the core-protocol-v3.0 Issue relates to the core protocol version 3.0 spec label May 21, 2020
@meggart
Copy link
Member

meggart commented May 25, 2020

I would very much appreciate having an "official" way to define dimension names. Currently I mimic the xarray conventions in my Julia code but this feels a bit risky since these conventions are not properly versioned so if there is a change in the future in how these conventions are handled this could lead to unexpected bugs. So I don't mind if this is in the core protocol or in some extension as long as there is a clean way to find out programmatically after which convention dimension names are defined.

@rabernat
Copy link
Contributor

rabernat commented Jun 3, 2020

I agree with this proposal.

It seems like we definitely want to synchronize this with whatever @DennisHeimbigner, @WardF, and the rest of the Unidata crew decide to do about dimension names.

@DennisHeimbigner
Copy link

DennisHeimbigner commented Jun 3, 2020 via email

@alimanfoo
Copy link
Member Author

I would suggest that, if we support dimension names in the v3 spec, then they are simply string labels for the dimensions of an array. Nothing else is implied. I.e., if two arrays happen to use the same name for a particular dimension, then at the level of the v3 protocol, that does not imply anything. It could mean that the two arrays have a "shared dimension" in the netCDF sense, it could just be coincidence, at least as far as a vanilla implementation of the v3 protocol is concerned.

A library that supports the full netCDF data model might then choose to treat these dimension names as names for shared dimensions, that would be fine and up to the netCDF layer implementation to manage.

Hope that makes sense.

@DennisHeimbigner
Copy link

DennisHeimbigner commented Jun 3, 2020 via email

@alimanfoo
Copy link
Member Author

However, the dimension name and size must be stored in the metadata independent of any variable. So adding a dimension may interfere with asynchronicity.

I may need some help from @rabernat here, there's a few different "dimensions" to this problem (sorry for the very bad pun :-)

Note that in this proposal I am simply proposing a metadata property for giving names to the dimensions (axes) of an array. Perhaps the property should be called dimension_names to make that clear. In any case, there is no implication that these dimensions are shared with any other arrays.

E.g., with this feature I could create an array with shape (10, 5) and name the dimensions ("foo", "bar"). In the zarr protocol, it would be totally fine to create another array with shape (100, 5) and name the dimensions ("foo", "qux"). I.e., creating each of these arrays is an independent operation, and the names are just labels for the axes of the arrays, not necessarily shared.

I.e., a vanilla zarr implementation would just offer the ability to provide names for the dimensions (axes) of an array, and might show those names when providing a visual representation of the array, but that would be it.

Now, a higher-level library implementing the netCDF data model might choose to interpret these as names for shared dimensions, under certain circumstances. I.e., if two arrays within the same group both have the name "foo" for one of their dimensions, then assume they are referring to a shared dimension.

This is similar to what xarray does currently. The main difference is that xarray uses an attribute called _ARRAY_DIMENSIONS, whereas this proposal offers a standard metadata property called dimensions (or dimension_names) which might be used for that purpose. There is a slight difference though, in that xarray knows that the _ARRAY_DIMENSIONS attribute is always supposed to indicate names for shared dimensions. I.e., there is stronger semantics for _ARRAY_DIMENSIONS than for the proposed dimensions array metadata property.

Perhaps it would be easier to avoid potential confusion, and for zarr to not try to cross into the netCDF space, and rather allow that to be dealt with via a set of usage conventions that properly deal with the netCDF semantics, such as the xarray approach or the nzcarr approach.

@alimanfoo
Copy link
Member Author

alimanfoo commented Jun 4, 2020

However, the dimension name and size must be stored in the metadata independent of any variable.

Also noting that IIUC this is not necessarily true, e.g., the xarray approach does not separately store dimension names and sizes. This is different from the nczarr proposal. Note that I have no opinion on which of these two approaches is best, just noting the difference.

@rabernat
Copy link
Contributor

rabernat commented Jun 4, 2020

Note that in this proposal I am simply proposing a metadata property for giving names to the dimensions (axes) of an array. Perhaps the property should be called dimension_names to make that clear. In any case, there is no implication that these dimensions are shared with any other arrays.

👍 This is how I have been thinking of it. Rather than calling the axes 0, 1, 2, we can call them time, lat, lon. Additional extensions or application could decide to interpret this in different ways, such as in the netCDF data model.

However, the dimension name and size must be stored in the metadata independent of any variable.

I don't see why. The dimension size is the determined by the shape of the array.

@DennisHeimbigner
Copy link

DennisHeimbigner commented Jun 4, 2020 via email

@joshmoore
Copy link
Member

Thinking out loud somewhat, I wonder if restricting dimension_names to [a-zA-Z0-9_] for the moment wouldn't be prudent. That would allow nice Python referencing and would allow a potential future extension to pathed (/) or dotted (.) nomenclature for looking up named dimensions in the future?

@Carreau
Copy link
Contributor

Carreau commented Sep 25, 2020

Update RFC to say this is something we'd like input on.

@jbms
Copy link
Contributor

jbms commented Feb 8, 2022

I would also like to see built-in support for dimension names, and would also suggest that, for simplicity, the zarr specification itself make no assumptions about "shared dimensions" between multiple arrays.

Aside from possible constraints on the allowed characters, I think that empty labels should be allowed (and indicate an unnamed dimension), and non-empty labels must be distinct. Not specifying the dimension names at all would be equivalent to specify all empty strings as the dimension labels.

@d-v-b
Copy link
Contributor

d-v-b commented Feb 8, 2022

What's the advantage of allowing empty labels?

@jbms
Copy link
Contributor

jbms commented Feb 8, 2022

Given that dimension names would be optional, it seems natural to me to allow that optionality on a per-dimension basis. E.g. maybe you are computing some sort of multiplication or partial reduction between two zarr arrays A and B, where A has labels and B does not. If the result has some dimensions corresponding to dimensions of A and some dimensions corresponding to dimensions of B, we would like to preserve the dimension labels from A without having to invent fake labels for B.

However, I don't feel too strongly about allowing empty labels.

@DennisHeimbigner
Copy link

I assume that this would operate like _ARRAY_DIMENSIONS
in that the size of the named dimension is determined from the corresponding
position in the "shapes" key. This of course can lead to inconsistency in the size of a
named dimension. Not surprisingly, I prefer the netcdf approach where the name and size
are declared separately from any variable so that inconsistency is not possible.

@DennisHeimbigner
Copy link

Another point. Unless you require all dimension names to be "global",
then you will need to use fully qualified names (fqn) for dimension names.
So one might have something like this.

"dimensions":` ["/dim1", "/grp1/grp2/dim2"]

@DennisHeimbigner
Copy link

DennisHeimbigner commented Feb 8, 2022

WRT anonymous dimensions. One approach is to merge the shape and dimension keys
and make dimension names be JSON strings and anonymous dimensions be integers.
This avoids empty labels.

@jbms
Copy link
Contributor

jbms commented Feb 8, 2022

If we allow anonymous dimensions, then I would say they indeed have to be specified by their index rather than name, but of course named dimensions could also be specified by index.

And in many contexts, e.g. for display to a user, I agree that it would be very natural to display just the index in place of the name for anonymous dimensions.

Although the dimension names could be quoted to avoid ambiguity, it might also be good to disallow dimension names that consist only of digits 0-9.

However, I'm unclear exactly what you are proposing as far as having dimension names be either strings or integers. Would that just be a concern of a specific implementation, rather than the zarr spec itself?

Also as far as referencing dimensions by path, as far as I can tell nothing in the current spec requires referencing dimensions; I suppose you are thinking from the context of an extension like ome-zarr or a version of netcdf built on top of zar

While I agree that the netcdf data model makes a lot of sense in many cases, I'm not sure how well the unique dimension names constraint / consistent size for every named dimension constraint fits with all intended uses of zarr v3. I guess users could always work around that issue by putting each zarr array in a separate zarr repository, but users might wish to get other data organizational advantages of having multiple arrays in a single zarr repository without constraining themselves to the netcdf data model.

@DennisHeimbigner
Copy link

Although the dimension names could be quoted to avoid ambiguity, it might also be good to disallow dimension names that consist only of digits 0-9.

That is the reason I made the string vs number distinction. And the fact that netcdf allows
dimension names that are all digits.

@DennisHeimbigner
Copy link

Also as far as referencing dimensions by path, as far as I can tell nothing in the current spec requires referencing dimensions; I suppose you are thinking from the context of an extension like ome-zarr or a version of netcdf built on top of zar

I do not understand this comment.
I was referring to a case where we have a variable v1
defined in a group /g1 (i.e just below the root group)
something like this:

"shape": [ 1, 17]
"dimensions" ["dim1", "dim17"]

Suppose we have another variable v2 in group /g2.

"shape": [ 17]
"dimensions" ["dim17"]

How do we know that the two dim17's refer to the same dimension?
I would prefer that "dim17" be replaced with "/g1/dim17"
so that it is clear that the same dimension is being used.

Of course, this assumes one wants the shared dimension name semantics
to matter, but that, of course, is the whole point of named dimensions.

@jbms
Copy link
Contributor

jbms commented Feb 8, 2022

It seems like just using a unique dimension name might be more natural than specifying a dimension by reference to another array, but I am not sure.

Certainly netcdf shared dimension semantics are applicable in some applications, but I think there are other applications where dimension names are useful but the constraint that all dimensions with a given name should have the same extent is not useful. For example:

  • multiscale dataset, where you have arrays storing the data at multiple scales. Here the dimension names could indicate the correspondence between the dimensions of the arrays at different scales, but the extents will of course be different.
  • a large collection of images, with dimensions x, y, c, and a convolutional neural network model with input dimensions x, y, c. All of the images may have different x, y dimensions but you want to apply the neural network model to them, and be sure you aren't accidentally transposing x and y.

@DennisHeimbigner
Copy link

t seems like just using a unique dimension name might be more natural than specifying a dimension by reference to another array, but I am not sure.

In a sense I agree which is why netcdf declares dimensions separately from variables.
But it appears that this community would rather declare the dimensions as part of the
variable declaration.

@DennisHeimbigner
Copy link

Your examples still prove my point. You are assuming that the dimensions with the same
name are semantically the same. The issue is being able to use the same simple name (e.g. "x")
in multiple places with different extents. But you still need to disambiguate those
multiple declarations and using the fqn is IMHO the best way to do that.

@DennisHeimbigner
Copy link

I think that coordinate variables are important in this discussion.

Suppose we have the following:

dimensions:  lat=5, lon=4;
variables:
float temp(lat,lon);
float lat(lat);
float lon(long);

The temp variable represents the temperature at a given latitude and longitude.

The longitude values are, say, -1deg. thru 2deg.
and the latitude values are, say, -0.5deg. thru 1.5deg.
However the lat dimension runs from 0 thru 4 and lon runs from 0 thru 3.
The so-called coordinate variables map the raw indices to
the actual lat and lon values of the coordinates.
So we have:

lat = -0.5, 0.0, 0.5, 1.0, 1.5 ;
lon = -1.0, 0.0, 1.0, 2.0 ;

This concept of coordinate variables is extremely useful but it relies on
the use of shared names to indicate shared semantics.

@jbms
Copy link
Contributor

jbms commented Feb 9, 2022

I agree that shared names to indicate "shared semantics" in some sense is the point of named dimensions, but I think exactly what those "shared semantics" are depends on the application.

If zarr were to use the netcdf data model, where shared name means shared domain, then how do you propose to deal with the use case of a single zarr repository where the root group contains a collection of arrays named sample0, sample1, ..., sampleN. Each of these samples are 3-d xyc images but they don't all have the same x and y dimensions. How would we assign dimension names in this case?

@DennisHeimbigner
Copy link

In netcdf, you put the various dimensions in different groups (possibly with the relevant
variables).

jbms added a commit to jbms/zarr-specs that referenced this issue May 31, 2022
This adds support for dimension
names (zarr-developers#73) and
non-zero
origins (zarr-developers#122).
@jstriebel
Copy link
Member

Crosslinking #149 (comment)

@jstriebel
Copy link
Member

Resolved via #162.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core-protocol-v3.0 Issue relates to the core protocol version 3.0 spec todo pre-rfc
Projects
Status: Done
Development

No branches or pull requests

9 participants