Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataTree access to variables in parent groups #9056

Closed
TomNicholas opened this issue May 30, 2024 · 2 comments
Closed

DataTree access to variables in parent groups #9056

TomNicholas opened this issue May 30, 2024 · 2 comments
Labels

Comments

@TomNicholas
Copy link
Contributor

TomNicholas commented May 30, 2024

Motivation

Accessing variables from parent groups in a tree would be useful. This has come up before in #1982 and xarray-contrib/datatree#297. Here I'm going to summarize some discussion from recent datatree meetings .

A use case is to have common coordinate variables between multiple sub-groups, for example this multi-resolution datatree has a time coordinate that conceptually is common to two groups:

DataTree('None', parent=None)
│   Dimensions:  (time: 4)
│   Coordinates:
│     * time     (time) int64 32B 0 1 2 3Data variables:
│       *empty*
├── DataTree('low')
│       Dimensions:  (x: 3, time: 4)
│       Coordinates:
│         * x        (x) float64 24B 1.0 5.0 9.0Dimensions without coordinates: timeData variables:
│           a        (x, time) int64 96B 0 1 2 3 4 5 6 7 8 9 10 11
└── DataTree('high')
        Dimensions:  (x: 9, time: 4)
        Coordinates:
          * x        (x) float64 72B 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0
        Dimensions without coordinates: time
        Data variables:
            a        (x, time) int64 288B 0 1 2 3 4 5 6 7 8 ... 28 29 30 31 32 33 34 35

It would be useful to be able to access the time coordinate variable from either child group, i.e. dt['/high'].time.

Indeed, the CF conventions explicitly describe this type of behaviour, in terms of searching for variables outside of the current group

Search by proximity

A variable or dimension specified with no path (for example, lat) refers to the variable or dimension of that name, if there is one, in the referring group. If not, the ancestors of the referring group are searched for it, starting from the direct ancestor and proceeding toward the root group, until it is found.

Problem

We could imagine changing the interface of DataTree to allow users to access any compatible variables on parent groups, where compatible means alignable.

There are three issues with this:

  1. Not all users will want to inherit all such variables,
  2. It would be a breaking change compared to the behaviour of the original datatree package,
  3. Mapping operations (e.g. .mean()) over multiple nodes becomes really confusing, because copies of the same variable would effectively be present in multiple nodes.

Proposal

Let me make a concrete feature proposal for discussion, which has some specific features:

  1. Keep .ds, .__getitem__ etc. on DataTree as-is. This means no breaking of backwards compatibility. This also means that we don't have to wait to implement all the details of this before releasing datatree in xarray main.

  2. A clear definition of "compatible variables" for inheritance. These are alignable variables that exist on a parent (or grandparent etc.) Q: Should these be just coordinate variables? Or all variables?

  3. Add additional API which allows access to inherited variables, via a new .inherit accessor on DataTree objects. (The name is not great, please feel free to suggest alternatives.)

    • Whilst dt[...] will never give access to inherited vars, dt.inherit[...] would allow __getitem__ access to inherited vars
    • dt.inherit.ds would return a DatasetView of that node with extra inherited variables in it
    • dt.inherit.to_dataset() -> xr.Dataset containing inherited vars
    • Explicit API for propagating / shallow-copying all variables to child nodes?
      • dt.inherit()? -> DataTree
  4. Don't change map_over_subtree (again for backwards compatibility)

    • map_over_inherited_subtree isolates the conceptuals of mapping over tree with inherited variables
      • issues: e.g. map over and see the same variable multiple times (in its "local" group and in all its child groups)

This will be a new feature, to be done in a separate release (i.e. no blocker right now)

Implementation

dt.inherit returns an InheritedNode, which at construction time creates and caches a mapping of all inherited variables (._inherited_variables). This then acts like a normal DataTree node except that it consults the inherited variables instead of the normal list of variables.

Creating the list of inherited variables is done by walking up the tree from the current node, examining new variables as they are encountered.

Q: Does this design handle coordinate names?

EDIT: Actually there's an even simpler idea: ds.inherit -> DataTree which has a shallow copy of all compatible variables inherited onto that node. Then .ds, .__getitem__ etc. will automatically behave as expected, as you will just have a new DataTree object with more valid keys.

Describe alternatives you've considered

  1. Not add any support for inheriting variables

That's what we currently have, and with this proposal we could eventually remove it if it turned out no-one liked it.

  1. Integrate support into the existing API (i.e. change dt.__getitem__ to access inherited variables)

It's not possible to do this without breaking changes. It's also not clear that there is a general one-size-fits-all answer to when variables should or shouldn't be inherited. This proposal provides both behaviours.

  1. Allow users to change behaviour of objects

Some kind of switch (on the specific object instances, globally, or with a context manager) could be used to switch between the two behaviours. But this seems extremely error-prone, and means that user code becomes ambiguous without knowing the state of the switch.

cc @shoyer @keewis @flamingbear @owenlittlejohns @eni-awowale

also @alexamici @benbovy I would love to hear your thoughts too.

@shoyer
Copy link
Member

shoyer commented May 30, 2024

Let's consider a slightly variation on the inheritance proposal: instead of inheriting all variables, I propose that Xarray should only inherit coordinates (and any associated indexes).

This is slightly inconsistent with CF, but I think better captures the spirit of inherited dimensions in the netCDF data model, and I don't think there are use cases for inheriting data variables (these don't get automatically associated with DataArray objects as coordinates, so they might as well be accessed directly).

@shoyer
Copy link
Member

shoyer commented Sep 8, 2024

We implemented coordinate inheritance in #9063.

See #9077 for in-depth discussion.

@shoyer shoyer closed this as completed Sep 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Development

No branches or pull requests

2 participants