Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Future Implementation based on NetCDF Groups #18

Open
darothen opened this issue Aug 9, 2017 · 2 comments
Open

Future Implementation based on NetCDF Groups #18

darothen opened this issue Aug 9, 2017 · 2 comments

Comments

@darothen
Copy link
Owner

darothen commented Aug 9, 2017

As raised at pydata/xarray#1092, a potential way to resolve the issue where different Cases have common dimensions but different coordinates (e.g., comparing a model with a 2 deg grid vs one with a 1 deg grid) would be to lean on the "Group" idiom implemented as part of the NetCDF CDM. The data model defines a Group such that:

A Group is a container for Attributes, Dimensions, EnumTypedefs, Variables, and nested Groups. The Groups in a Dataset form a hierarchical tree, like directories on a disk. There is always at least one Group in a Dataset, the root Group, whose name is the empty string.

Ignoring the "meta"-ness of this idiom (and its similarity to the on-disk organization that experiment encourages), it enables something very powerful, because variables organized in different groups can have different dimensions. A major issue raised in pydata/xarray#1092 is how this sort of functionality could be extended on top of existing xarray data structures. But selection semantics aren't necessarily the tough part of this; that's just an issue of building a sensible interface.

The broader idea is to encapsulate an entire collection of numerical model output in the same common framework, to help automate the generation of complex analyses. Using the sample data in experiment and the idiom I use now, that might look something like this:

<xarray.Dataset>
Dimensions:  (time: 10, x: 5, y: 5, param1: 3, param2: 3, param3: 2)
Coordinates:
  * param3   (param3) <U5 'alpha' 'beta'
  * param2   (param2) int64 1 2 3
  * param1   (param1) <U1 'a' 'b' 'c'
  * time     (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 ...
  * x        (x) float64 0.0 2.5 5.0 7.5 10.0
  * y        (y) float64 0.0 2.5 5.0 7.5 10.0
    numbers  (time) int64 0 1 2 3 4 5 6 7 8 9
Data variables:
    precip   (time, x, y) float64 -1.072 -0.124 -0.05287 -0.9504 -1.527 ...

So this corresponds to a dense, 6D tensor or array which is constructed from (332) NetCDF files on disk. But, it would be equivalent to a NetCDF file with hierarchical groups, which could potentially look something like this:

<experiment.Experiment>
Dimensions:  (time: 10, x: *, y: *)
Groups: param1: 3 // param: 3 // param3: 2
Cases:
  + param3   "Parameter 3" <U5 'alpha' 'beta'
  + param2   "Parameter 2" int64 1 2 3
  + param1   "Parameter 1" <U1 'a' 'b' 'c'
Coordinates:
  * time   (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 ...
  * x        (x) float64 [param1, ]
  * y        (y) float64 [param1, ]
    numbers  (time) int64 0 1 2 3 4 5 6 7 8 9
Data variables:
    precip   (time, x, y) float64 {min: -10.0, max: 10.0}

What's going on here:

  1. The Cases which comprise this Experiment are recorded, similar to how they are currently
  2. We've set a hierarchy for the Cases to match a Group hierarchy. The only invariant here is Cases or Groups which contain datasets where Coordinate values may differ (e.g. the 2 deg vs 1 deg grid) should appear first in the hierarchy... although it could also be last, I need to think about the implications of either/or.
  3. Every single dataset has the same set of Dimensions, but their coordinate values might differ. We indicate that in the readout of the Experiment in two places: under "Dimensions" and then under the "Coordinates" values.
  4. To accommodate the fact that underlying arrays may have inconsistent shapes, we do not printout preview values under "Data variables" but maybe a summary statistic or just some key attributes.

Immediately, this gives a way to serialize to disk - this entire data structure directly translates to a NetCDF4 file with Groups. Groups can of course have metadata, which we would specify as part of their Case instantiation. Furthermore, the order/layout of Cases would become important when you initialize an Experiment, since that dictates the structure of the translated netCDF dataset.

Underneath the hood, I don't think anything in Experiment has to change. We still use a multi-keyed dictionary to access the different Datasets that get loaded. But the idea of a "master dataset" is now irrelevant - as long as all the different Datasets are loaded lazily via dask, then selecting out of the underlying dictionary and performing arithmetic is just deferring to xarray. The only question is how to directly apply calculations... it seems like we don't gain anything by looping over that dictionary, but we can always apply the calculations in parallel asynchronously via Joblib and reserve placeholders using the futures we get back. I mean, xarray itself is looping over the Variables in a Dataset... right?

The other key issue is that this interface implicitly encodes when dimension mis-matching will occur, since we are recording Groups which feature data with dimensions that do not agree. This means that we can build in functionality to implicitly resolve these inconsistencies. Similar to the semantics for "aligning" in pandas, we could incorporate crude functionality to automatically re-grid data so that normal array operations work. For the case of bilinear resampling on simple rectangular coordinate systems, that's super easy. The framework would then strongly motivate the broader pangeo-data goal of universal re-gridding tools.

These are just some early thoughts to consider.

@spencerahill
Copy link

I like this, especially the last part:

Similar to the semantics for "aligning" in pandas, we could incorporate crude functionality to automatically re-grid data so that normal array operations work. For the case of bilinear resampling on simple rectangular coordinate systems, that's super easy. The framework would then strongly motivate the broader pangeo-data goal of universal re-gridding tools.


Having read this in entirety and skimmed pydata/xarray#1092, an important question seems to be: should this go here or upstream within xarray? There seems to be some desire for an xarray-level solution. Supposing that was implemented, what would remain here?

@darothen
Copy link
Owner Author

I think that if a group of us invest time in this angle, we should do it beyond the confines of xarray to start. @benbovy's idea of "a data structure built on top of Dataset" is the way to go imho, with some sort of new syntactic sugar to access into the different groups - or to act as a "puppeteer" to coordinate operations across groups.

In the post, a few folks have been quite resistant to implementing this ladder of the netCDF data model (or extension to it) internally within xarray. Doesn't mean that will always be the case :) But I think a proof of concept external to the core library would be a good place to start.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants