Future Implementation based on NetCDF Groups #18

darothen · 2017-08-09T13:34:53Z

As raised at pydata/xarray#1092, a potential way to resolve the issue where different Cases have common dimensions but different coordinates (e.g., comparing a model with a 2 deg grid vs one with a 1 deg grid) would be to lean on the "Group" idiom implemented as part of the NetCDF CDM. The data model defines a Group such that:

A Group is a container for Attributes, Dimensions, EnumTypedefs, Variables, and nested Groups. The Groups in a Dataset form a hierarchical tree, like directories on a disk. There is always at least one Group in a Dataset, the root Group, whose name is the empty string.

Ignoring the "meta"-ness of this idiom (and its similarity to the on-disk organization that experiment encourages), it enables something very powerful, because variables organized in different groups can have different dimensions. A major issue raised in pydata/xarray#1092 is how this sort of functionality could be extended on top of existing xarray data structures. But selection semantics aren't necessarily the tough part of this; that's just an issue of building a sensible interface.

The broader idea is to encapsulate an entire collection of numerical model output in the same common framework, to help automate the generation of complex analyses. Using the sample data in experiment and the idiom I use now, that might look something like this:

<xarray.Dataset>
Dimensions:  (time: 10, x: 5, y: 5, param1: 3, param2: 3, param3: 2)
Coordinates:
  * param3   (param3) <U5 'alpha' 'beta'
  * param2   (param2) int64 1 2 3
  * param1   (param1) <U1 'a' 'b' 'c'
  * time     (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 ...
  * x        (x) float64 0.0 2.5 5.0 7.5 10.0
  * y        (y) float64 0.0 2.5 5.0 7.5 10.0
    numbers  (time) int64 0 1 2 3 4 5 6 7 8 9
Data variables:
    precip   (time, x, y) float64 -1.072 -0.124 -0.05287 -0.9504 -1.527 ...

So this corresponds to a dense, 6D tensor or array which is constructed from (332) NetCDF files on disk. But, it would be equivalent to a NetCDF file with hierarchical groups, which could potentially look something like this:

<experiment.Experiment>
Dimensions:  (time: 10, x: *, y: *)
Groups: param1: 3 // param: 3 // param3: 2
Cases:
  + param3   "Parameter 3" <U5 'alpha' 'beta'
  + param2   "Parameter 2" int64 1 2 3
  + param1   "Parameter 1" <U1 'a' 'b' 'c'
Coordinates:
  * time   (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 ...
  * x        (x) float64 [param1, ]
  * y        (y) float64 [param1, ]
    numbers  (time) int64 0 1 2 3 4 5 6 7 8 9
Data variables:
    precip   (time, x, y) float64 {min: -10.0, max: 10.0}

What's going on here:

The Cases which comprise this Experiment are recorded, similar to how they are currently
We've set a hierarchy for the Cases to match a Group hierarchy. The only invariant here is Cases or Groups which contain datasets where Coordinate values may differ (e.g. the 2 deg vs 1 deg grid) should appear first in the hierarchy... although it could also be last, I need to think about the implications of either/or.
Every single dataset has the same set of Dimensions, but their coordinate values might differ. We indicate that in the readout of the Experiment in two places: under "Dimensions" and then under the "Coordinates" values.
To accommodate the fact that underlying arrays may have inconsistent shapes, we do not printout preview values under "Data variables" but maybe a summary statistic or just some key attributes.

Immediately, this gives a way to serialize to disk - this entire data structure directly translates to a NetCDF4 file with Groups. Groups can of course have metadata, which we would specify as part of their Case instantiation. Furthermore, the order/layout of Cases would become important when you initialize an Experiment, since that dictates the structure of the translated netCDF dataset.

Underneath the hood, I don't think anything in Experiment has to change. We still use a multi-keyed dictionary to access the different Datasets that get loaded. But the idea of a "master dataset" is now irrelevant - as long as all the different Datasets are loaded lazily via dask, then selecting out of the underlying dictionary and performing arithmetic is just deferring to xarray. The only question is how to directly apply calculations... it seems like we don't gain anything by looping over that dictionary, but we can always apply the calculations in parallel asynchronously via Joblib and reserve placeholders using the futures we get back. I mean, xarray itself is looping over the Variables in a Dataset... right?

The other key issue is that this interface implicitly encodes when dimension mis-matching will occur, since we are recording Groups which feature data with dimensions that do not agree. This means that we can build in functionality to implicitly resolve these inconsistencies. Similar to the semantics for "aligning" in pandas, we could incorporate crude functionality to automatically re-grid data so that normal array operations work. For the case of bilinear resampling on simple rectangular coordinate systems, that's super easy. The framework would then strongly motivate the broader pangeo-data goal of universal re-gridding tools.

These are just some early thoughts to consider.

The text was updated successfully, but these errors were encountered:

spencerahill · 2017-08-28T23:18:34Z

I like this, especially the last part:

Similar to the semantics for "aligning" in pandas, we could incorporate crude functionality to automatically re-grid data so that normal array operations work. For the case of bilinear resampling on simple rectangular coordinate systems, that's super easy. The framework would then strongly motivate the broader pangeo-data goal of universal re-gridding tools.

Having read this in entirety and skimmed pydata/xarray#1092, an important question seems to be: should this go here or upstream within xarray? There seems to be some desire for an xarray-level solution. Supposing that was implemented, what would remain here?

darothen · 2017-08-29T12:43:39Z

I think that if a group of us invest time in this angle, we should do it beyond the confines of xarray to start. @benbovy's idea of "a data structure built on top of Dataset" is the way to go imho, with some sort of new syntactic sugar to access into the different groups - or to act as a "puppeteer" to coordinate operations across groups.

In the post, a few folks have been quite resistant to implementing this ladder of the netCDF data model (or extension to it) internally within xarray. Doesn't mean that will always be the case :) But I think a proof of concept external to the core library would be a good place to start.

darothen mentioned this issue Aug 9, 2017

How to handle calculations for variables not defined in longitude spencerahill/aospy#194

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Future Implementation based on NetCDF Groups #18

Future Implementation based on NetCDF Groups #18

darothen commented Aug 9, 2017

spencerahill commented Aug 28, 2017

darothen commented Aug 29, 2017

Future Implementation based on NetCDF Groups #18

Future Implementation based on NetCDF Groups #18

Comments

darothen commented Aug 9, 2017

spencerahill commented Aug 28, 2017

darothen commented Aug 29, 2017