Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resolve dimension coordinates from parents #297

Closed
dehorsley opened this issue Dec 19, 2023 · 9 comments
Closed

Resolve dimension coordinates from parents #297

dehorsley opened this issue Dec 19, 2023 · 9 comments

Comments

@dehorsley
Copy link

Basically this issue but for datatree pydata/xarray#1982

We have some hourly and weekly data in different groups, and shared coordinates in the root. For example, I have a netCDF with the following structure:

netcdf test {
dimensions:
        storage = 11 ;
        sample = 780 ;
        case = 1 ;
        generator = 31 ;
        line = 1 ;
        region = 1 ;
        price_scenario = 78 ;
variables:
        string storage(storage) ;
        float sample(sample) ;
        string case(case) ;
        string generator(generator) ;
        string line(line) ;
        string region(region) ;
        int price_scenario(price_scenario) ;

group: weekly {
  dimensions:
        time = 549 ;
  variables:
        int time(time) ;
                time:units = "Seconds since 1970" ;
        float generation(time, generator, sample, case) ;
                generation:units = "GWh" ;
        float storage_volume(time, storage, sample, case) ;
                storage_volume:units = "CMD" ;
        float line_flow(time, line, sample, case) ;
                line_flow:units = "GWh" ;
        float line_flow_back(time, line, sample, case) ;
                line_flow_back:units = "GWh" ;
        float generator_offer(time, generator, sample, case) ;
                generator_offer:units = "$/MWh" ;
  } // group weekly

group: hourly {
  dimensions:
        time = 92232 ;
  variables:
        int time(time) ;
                time:units = "Seconds since 1970" ;
        float region_price(time, region, price_scenario, case) ;
                region_price:units = "$/MWh" ;
  } // group hourly
}

I was hoping datatree would add the shared coordinates variables when I access one of the child groups, but instead if I access one of the Datasets, I have a whole bunch of dimensions without coordinates. Eg:

image

Is this something you think datatree should/will address? Thanks!

@keewis
Copy link
Contributor

keewis commented Jan 3, 2024

the trouble here is that there are many datasets where sharing coordinates is not desired or possible, or it mutates the dataset in a way that makes it invalid. So I don't think we should do this by default, and doing so on variable access just makes the code (and behavior) way too complicated.

However, what I could imagine is adding a method that you can explicitly call to (shallow) copy the variables from the parent nodes to the children.

@dehorsley
Copy link
Author

If I understand correctly, this is how the CF Conventions scoping is supposed to work. So I'd imagine there is a great number of datasets out there where this is how you'd want to resolve coordinates, and certainly something worth making easy for users. I'd personally love to see CF Conventions by default, and optionally disabled for using non-CF datasets.

@marcel-goldschen-ohm
Copy link

marcel-goldschen-ohm commented Feb 7, 2024

I agree with @TomNicholas that this should not be default behavior, but would be a VERY useful option. Here is some code for a tree of xarray Datasets that I'm thinking of abandoning in favor of DataTree, but it does provide some functions for inheriting data or coords from ancestors. See the inherited_data and inherited_coord methods. Might be a good start for what you want.

@TomNicholas
Copy link
Collaborator

TomNicholas commented Feb 27, 2024

So the reason I did not implement this "inheritance" of information from parent groups is that (a) it was considerably simpler not to and (b) I thought it could lead to inconsistent states.

However, today I had a long discussion with @shoyer about this (plus @flamingbear, @etienneschalk and @eni-awowale), and now I think this might be possible.

My original concern was what happens if I inherit a variable with a dimension with the same name but different length to what's already present in this node? i.e.

- node A
| dims: {time: 10}
| vars: [time]
|--  node B
     dims: {time: 20}
     vars: [foo, bar, ...]

where the variables foo, bar have a time dimension of length 20. In this situation calling B.ds['time'] would look upwards to find the length 10 time variable to use, but then it wouldn't be consistent with the time dimensions of foo and bar. It would be attempting to create a group that violated xarray's requirements to be a valid Dataset object (i.e. that variables can't have different lengths along dimensions of the same name).

This difficulty is why I did not implement this feature. But looking again today at the CF conventions for groups, it says

If any dimension of an out-of-group variable has the same name as a dimension of the referring variable, the two must be the same dimension (i.e. they must have the same netCDF dimension ID).

In the xarray model without explicit dimension objects, this translates to "the two dimensions must be the same length". Interpreting this "must" to mean "if this is not the case don't try to inherit this variable", I think this caveat resolves the concern I had.


We could imagine changing the behaviour of DataTree.__getitem__(key) to:

  1. If key exists on this node, return the variable under that name,
  2. If not, look upwards through parent groups until we find a variable of that name,
  3. If found, check that dimensions are compatible with the dimensions of variables on the calling group,
  4. If not found or the only found variables have incompatible-length dimensions, raise KeyError.

(This is essentially the "search by proximity" described in the CF conventions, just without the deprecated "lateral search" stuff.)

The questions now are:

  • How to experiment with implementing this (Stephan suggested using collections.Chainmap like h5netcdf does)
  • Would adding this feature exclude any users or types of files?
  • Can this lead to really weird or counter-intuitive behaviour?
  • How to display the different between an inherited and non-inherited variable in the reprs?
  • Does it conflict with other features of datatree, such as map_over_subtree?
  • Can this be safely serialized back to zarr?

@TomNicholas
Copy link
Collaborator

TomNicholas commented Feb 27, 2024

A practical API question here is whether or not inherited variables are considered to be "part of" the child DataTree node. For example, given the following (which unlike the example in the previous comment is valid and safe because the dimension lengths match):

dt
- node A
| dims: {time: 10}
| vars: [time]
|--  node B
     dims: {time: 10}
     vars: [foo, bar, ...]

should dt['/B'].variables return a collection that includes time or not? Internally dt['/B']._variables probably should not, else updating the variables on the node B would become messy. dt['/B'].ds.variables should definitely contain time, but it's not clear if dt['/B'] should display "local" or "global" information. This is conceptually somewhat similar to the issues in #240.

@shoyer
Copy link

shoyer commented Feb 27, 2024

dt['/B'].ds.variables should definitely contain time, but it's not clear if dt['/B'] should display "local" or "global" information. This is conceptually somewhat similar to the issues in #240.

I think dt['B'] and dt['/B'] should always resolve to the same object, with "global" information. Otherwise the search mechanism starts to impact the API itself.

We could easily have another interface for surfacing only the "local" dataset (e.g., dt['B'].node_dataset).

Incidentally, this would be a reasonable way to organize the contents of a node:

  • A private _dataset attribute on each node with an xarray.Dataset containing the underlying data.
  • A public .local_dataset attribute with a local DatasetView, constructed from the data in _dataset
  • A public .dataset attribute with a global DatasetView, constructed from data at different levels using ChainMap

@TomNicholas
Copy link
Collaborator

TomNicholas commented Feb 27, 2024

I think dt['B'] and dt['/B'] should always resolve to the same object, with "global" information. Otherwise the search mechanism starts to impact the API itself.

dt['/B'] and dt['B'] are the same (if you're at the root), the question is if dt['B'] and dt['B'].ds should display the same variables, or if only the latter should display inherited variables.

Incidentally, this would be a reasonable way to organize the contents of a node:

  • A public .local_dataset attribute with a local DatasetView, constructed from the data in _dataset

I quite like this idea. Providing a .local_dataset might also ease the datatree.DataTree->xarray.DataTree transition for existing users because then they can just change .ds->.local_dataset anywhere that they are worried and it will definitely work as before.

  • A private _dataset attribute on each node with an xarray.Dataset containing the underlying data.

I think I can test the inheritance idea orthogonally from this suggested refactor.

@TomNicholas
Copy link
Collaborator

TomNicholas commented Feb 28, 2024

Does it conflict with other features of datatree, such as map_over_subtree?

If a variable at the root is inherited down through the whole tree, it's now effectively present everywhere. If I then map a function/method over the whole tree with map_over_subtree, it's going to apply that function again and again to the same inherited variable for every single node in the tree. This seems silly... At the very least we wouldn't want to duplicate that work. This might be an argument for having map_over_subtree only map over leaves of the tree (because now the leaves will be populated by the inherited variables). That would be another big change though.

EDIT: Of course inheriting variables downwards + mapping only over the leaves does not add up to give equivalent behaviour to not inheriting + mapping over the whole tree...

@keewis
Copy link
Contributor

keewis commented Aug 13, 2024

closing this in favor of pydata/xarray#9056: we've been discussing this for a while in the datatree meetings, and the version in xarray actually supports inheritance (implemented in pydata/xarray#9063).

@keewis keewis closed this as completed Aug 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants