Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hierarchical coordinates in DataTree #9063

Merged
merged 25 commits into from
Jul 3, 2024
Merged

Conversation

shoyer
Copy link
Member

@shoyer shoyer commented Jun 3, 2024

Proof of concept implementation of #9056

Still probably needs more tests, suggestions welcome!

It's also possible that I've implemented this in a rather inelegant fashion, I am still wrapping my head around the tree-node abstraction.

CC @TomNicholas, @keewis, @owenlittlejohns, @flamingbear, @eni-awowale

@TomNicholas TomNicholas added the topic-DataTree Related to the implementation of a DataTree class label Jun 4, 2024
@TomNicholas
Copy link
Member

This is cool to see! And it shows that simply calling align on the data in parent nodes can be neat way to perform the check for inheritability.

However this PR doesn't actually implement the suggestion in #9056, because it places a new strong restriction on the datatree data model. As well as restricting the range of valid DataTree objects, it would actually mean that there are netCDF files that DataTree objects cannot represent:

# using `main`
import xarray as xr
from xarray.core.datatree import DataTree

# create the unalignable tree from your tests
dt = DataTree.from_dict(
    {
        "/": xr.Dataset({"a": (("x",), [1, 2])}),
        "/b": xr.Dataset({"c": (("x",), [3])}),
    }
)

# write a valid netCDF file...
dt.to_netcdf('test.nc')

# switch branch to `datatree-hierarchy`
from xarray.backends.api import open_datatree

dt2 = open_datatree('test.nc')  # will raise an error from align
ValueError: inconsistent alignment between node and parent datasets:

The alternative suggestion I made in #9056 was instead to provide an API which performs inheritance like this, but doesn't apply that restriction to all trees at construction time.

I was working on my own PR in #9056 which does a similar alignment check but only when the user calls a specific .inherit method. The downside of my approach is that the alignment would be checked every time the user calls .inherit, which is really inefficient:

dt.inherit['x']  # calls `align` once
dt.inherit['x']  # calls `align` again!

I think what we actually want is somewhere between the two: we want to calculate all the inheritable coordinates at tree construction time like you've done, but not actually enforce that by raising errors from align. Instead we cache the list of which variables are inheritable (again like you've done), but keep them hidden until we merge them into the node once the user calls .inherit (like I've tried to do).

Comment on lines 107 to 110
def _check_alignment(
node_ds: Dataset,
parent_ds: Dataset | None,
children: Mapping[Hashable, DataTree],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def _check_alignment(
node_ds: Dataset,
parent_ds: Dataset | None,
children: Mapping[Hashable, DataTree],
def _find_alignable_coordinates(
node_ds: Dataset,
parent_ds: Dataset | None,
) -> Coordinates:

Then we would store those coordinates on the node at construction time but not use them unless the user calls .inherit.

@TomNicholas
Copy link
Member

However this PR doesn't actually implement the suggestion in #9056, because it places a new strong restriction on the datatree data model.

I withdraw this complaint 😁 - see #9077 for discussion as to whether or not to impose this contraint.


@shoyer this suggestion might make no sense, but IIUC currently in this PR inherited variables & indexes are literally copied onto child nodes, relying on xarray's shallow copying features to not duplicate any underlying data. What I wonder is whether instead we could simply store the list of names of the inherited variables on the child node as ._inherited_var_names: set[str], and anytime we want access to them we simply look back up the tree until we find it. With a strictly-aligned model we can be sure that they definitely do exist further up the tree.

We could hide this detail by having properties like .variables and ._local_variables or similar, so that code which accesses these variables doesn't have to be changed. But then the model is a bit simpler as we're only storing string keys to represent the inherited variables (it's a little more like how coord_names is internally just a set[Hashable]).

The main tricky part I see with this approach is making sure the correct variable is altered when you try to write to an inherited variable, but I think we have to handle that carefully with either approach.

@@ -480,6 +489,7 @@ def to_dataset(self) -> Dataset:
--------
DataTree.ds
"""
# TODO: copy these container objects?
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: We definitely need tests to verify the to_dataset() creates an new copy of the node's data. (I suspect that current it does not.)

@shoyer
Copy link
Member Author

shoyer commented Jun 20, 2024

@TomNicholas CI is passing now, so this should be in a reasonable state for you to take a look!

@shoyer
Copy link
Member Author

shoyer commented Jun 20, 2024

@shoyer this suggestion might make no sense, but IIUC currently in this PR inherited variables & indexes are literally copied onto child nodes, relying on xarray's shallow copying features to not duplicate any underlying data. What I wonder is whether instead we could simply store the list of names of the inherited variables on the child node as ._inherited_var_names: set[str], and anytime we want access to them we simply look back up the tree until we find it. With a strictly-aligned model we can be sure that they definitely do exist further up the tree.

Indeed, this is how I implemented it currently. Can you think of use-cases for distinguishing node vs inherited data? I guess this could be a nice option for .to_dataset() and for writing data to Zarr?

@TomNicholas
Copy link
Member

Oh right! I think we might need to be able to distinguish internally when doing map_over_subtree

@shoyer
Copy link
Member Author

shoyer commented Jun 20, 2024

Oh right! I think we might need to be able to distinguish internally when doing map_over_subtree

I agree, it seems pretty clear that we need support for pulling out "node local" versions of datasets, without inheritance.

For this PR, I'll implement a keyword option on DataTree.to_dataset to allow for only pulling out these local variables. This might be spelled local=True or inherited=False, and will ensure that we have the right information in the DataTree data model.

I'll also add a test to verify that to_zarr() works properly, without duplicating coordinates.

@shoyer
Copy link
Member Author

shoyer commented Jun 21, 2024

I implemented to_dataset(local=True). Would love feedback!

One messy thing is that it appears that assignment via .coords is broken on DataTree even at main. No error is raised when you try to set coordinates, but they don't actually get changed, e.g.,

In [1]: from xarray.core.datatree import DataTree
   ...: import xarray as xr
   ...:
   ...: tree = DataTree.from_dict(
   ...:     {
   ...:         "/": xr.Dataset({"e": (("x",), [1, 2])}, coords={"x": [2, 3]}),
   ...:         "/b": xr.Dataset({"f": (("y",), [3])}),
   ...:         "/b/c": xr.Dataset(),
   ...:         "/b/d": xr.Dataset({"g": 4}),
   ...:     }
   ...: )

In [2]: tree
Out[2]:
<xarray.DataTree>
Group: /
│   Dimensions:  (x: 2)
│   Coordinates:
│     * x        (x) int64 16B 2 3
│   Data variables:
│       e        (x) int64 16B 1 2
└── Group: /b
    │   Dimensions:  (y: 1)
    │   Dimensions without coordinates: y
    │   Data variables:
    │       f        (y) int64 8B 3
    ├── Group: /b/c
    └── Group: /b/d
            Dimensions:  ()
            Data variables:
                g        int64 8B 4

In [3]: tree.coords['y'] = 5

In [4]: tree
Out[4]:
<xarray.DataTree>
Group: /
│   Dimensions:  (x: 2)
│   Coordinates:
│     * x        (x) int64 16B 2 3
│   Data variables:
│       *empty*
└── Group: /b
    │   Dimensions:  (y: 1)
    │   Dimensions without coordinates: y
    │   Data variables:
    │       f        (y) int64 8B 3
    ├── Group: /b/c
    └── Group: /b/d
            Dimensions:  ()
            Data variables:
                g        int64 8B 4

So I'm not quite sure how to test modifying coordinates yet. Maybe we'll leave that for a follow-up PR.

@TomNicholas
Copy link
Member

One messy thing is that it appears that assignment via .coords is broken on DataTree even at main.

Ah yes - I forgot there's a TODO deep somewhere for that 😅 I left it because it for later because it seemed like it might require changing the DatasetCoordinates class that ds.coords returns. (Surely that class could just be replaced by the new xr.Coordinates class now??)

@TomNicholas
Copy link
Member

TomNicholas commented Jun 25, 2024

What I wonder is whether instead we could simply store the list of names of the inherited variables on the child node as ._inherited_var_names: set[str], and anytime we want access to them we simply look back up the tree until we find it. With a strictly-aligned model we can be sure that they definitely do exist further up the tree.

Indeed, this is how I implemented it currently.

Hmm, looking again I'm not sure this is true. You have the attributes defining the DataTree node as

class DataTree:
    _name: str | None
    _parent: DataTree | None
    _children: dict[str, DataTree]
    _cache: dict[str, Any]  # used by _CachedAccessor
    _local_variables: dict[Hashable, Variable]
    _local_coord_names: set[Hashable]
    _local_dims: dict[Hashable, int]
    _local_indexes: dict[Hashable, Index]
    _variables: dict[Hashable, Variable]
    _coord_names: set[Hashable]
    _dims: dict[Hashable, int]
    _indexes: dict[Hashable, Index]
    _attrs: dict[Hashable, Any] | None
    _encoding: dict[Hashable, Any] | None
    _close: Callable[[], None] | None

i.e. every local variable (and index) are explicitly stored on every node, and every inherited variable (and index) are also stored explicitly on every node. What I was suggesting was something more like

class DataTree
    _name: str | None
    _parent: DataTree | None
    _children: dict[str, DataTree]
    _cache: dict[str, Any]  # used by _CachedAccessor
    _local_variables: dict[Hashable, Variable]  # these 4 attributes we already have, they're  just renamed
    _local_coord_names: set[Hashable]
    _local_dims: dict[Hashable, int]
    _local_indexes: dict[Hashable, Index]
    _inherited_coord_names: set[Hashable]  # note this is now the only new attribute needed compared to current non-inheriting model
    _attrs: dict[Hashable, Any] | None
    _encoding: dict[Hashable, Any] | None
    _close: Callable[[], None] | None

    @property
    def _inherited_coords(self) -> dict[Hashable, Variable]:
        # we know these exist somewhere further up the tree
        return {self._get_inherited_coord_from_above(coord_name) for coord_name in self._inherited_coord_names}

    @property
    def variables(self) -> dict[Hashable, Variable]:
        # we know these are all alignable because we checked it at construction time
        all_variables = {**self._local_variables, **self._inherited_coords}
        return all_variables

    @property
    def _dims(self) -> dict[Hashable, int]:
        """All dims present on local and inherited variables"""
        return ...

where the set of all variables if defined as a property that looks up the tree to fetch the correct inherited variables (so they aren't stored twice).

Basically the point is that the only extra thing I need to know is the names of the variables to inherit, and because I know they definitely exist, I can just generate any combination of local/inherited/all variables dynamically. With this design there are more properties that get built on-the-fly, but there are fewer internal attributes that must be kept consistent with one another.

@shoyer
Copy link
Member Author

shoyer commented Jun 30, 2024

  1. Check alignment between local variables and all the variables on the parent, including inherited variables. Inheritance will effectively populate those empty nodes, solving the issue. However if there the alignment check raises it then becomes really hard to generate a diff showing where the offending variables were defined. (Also not totally sure that this isn't circular logic...)

This is the solution I have currently implemented. I think it's OK -- the inherited coordinate variables are listed on the repr of the parent node -- though indeed it's not as clear as it could be. Ideally we would say exactly at which level all the conflicting coordinates came from.

@shoyer
Copy link
Member Author

shoyer commented Jun 30, 2024

  1. Are we checking alignment against all variables (data and coordinate), but only propagating down coordinate variables?

We're checking alignment against inherited dimensions and indexes, which exactly what xarray.align() checks -- it doesn't worry about data variables at all, or coordinate variables that are not also stored in indexes.

  1. Does it even make sense for _check_alignment to raise a ValueError of the form
    "group {path1} is not aligned with its parent: see {repr(group1)} vs {repr(group2)}"? The alignment criterion is that the current node aligns with the parent node including all coordinates inherited onto that parent node. But those inherited coordinates could have come from multiple different ancestor groups.

Note that the repr for the parent includes inherited variables.

So a single alignment error message could actually have happened due to variables in two different ancestor groups. For example:

tree = DataTree.from_dict(
    {
        "/": xr.Dataset({"x": (("x",), [1, 2])}),
        "/b": xr.Dataset({"y": (("y",), [1])}),
        "/b/c": xr.Dataset({"d": (("x", "y"), [[3, 4]])}),  # d has size {'x': 1, 'y': 2}, so doesn't align with either parent group.
    }
)

Here "d" doesn't align with "x" or "y", but you can't really say that the alignment error is caused by group "/b" misaligning with group "b/c" or by group "/"misaligning with group"/b/c"`, it's caused by both! It would make much more sense to say that group "/b/c" does not align with its parent group including all variables inherited onto the parent.

The current error message (from evaluating your snippet on this branch) is:

ValueError: group '/b/c' is not aligned with its parent:
Group:
    Dimensions:  (x: 1, y: 2)
    Dimensions without coordinates: x, y
    Data variables:
        d        (x, y) int64 16B 3 4
Parent:
    Dimensions:  (x: 2, y: 1)
    Coordinates:
      * x        (x) int64 16B 1 2
      * y        (y) int64 8B 1

I think is pretty descriptive, though perhaps we could more clearly indicate that it includes all parent nodes, even just by switching "parent" to "parents".

Comment on lines +535 to +546
# Note: rebuild_dims=False can create technically invalid Dataset
# objects because it may not contain all dimensions on its direct
# member variables, e.g., consider:
# tree = DataTree.from_dict(
# {
# "/": xr.Dataset({"a": (("x",), [1, 2])}), # x has size 2
# "/b/c": xr.Dataset({"d": (("x",), [3])}), # x has size1
# }
# )
# However, they are fine for internal use cases, for align() or
# building a repr().
dims = dict(self._dims)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TomNicholas This is where we make the invalid Dataset objects, though they are also propagated by _inherited_dataset() (if one of the inputs is invalid).

@shoyer
Copy link
Member Author

shoyer commented Jun 30, 2024

That said, there's no reason why the ChainMap objects need to be constructed when the DataTree objects are built. Instead, it probably makes more sense to do this dynamically, to avoid a separate redundant representation of the tree structure, e.g.,

@property
def _chained_coord_variables(self):
    maps = [self._coord_variables]
    maps.extend(p._coord_variables for p in parents)
    return ChainMap(*maps)

This is now implemented -- what do you think?

I think this captures of the essence of what you were aiming towards with #9187. Conveniently, it also simplifies the implementation of DataTree.

Copy link
Member

@TomNicholas TomNicholas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the thorough reply here @shoyer ! I think my main concerns are now addressed. 🙏

I'm going to approve, though I think there are a couple of little nits that could be neater. I would also love it if @owenlittlejohns / @flamingbear / @eni-awowale could at least take a look and say if they are following the general gist of what's being done here 😁

The Variable objects in _coord_variables will still be pointed to from multiple places

I see this as a pro, rather than a con. Using the exact same Variable objects means it is possible to modify metadata like attrs in-place .

👍 I think my main concern here was really about having two different representations of the tree at once, and that's dealt with now.

Our list of internal attributes on a DataTree node is still not just those on a Dataset plus the inherited coordinates

This is indeed a bit of a con, but in my mind the right fix is probably to adjust the Dataset data model to using dictionaries of data_variables and coord_variables, rather than the current solution of a dict of variables and a set of coord_names.

Interesting! I've split out #9203 to track this idea.

The ChainMap objects are just proxies that link them together.

That said, there's no reason why the ChainMap objects need to be constructed when the DataTree objects are built. Instead, it probably makes more sense to do this dynamically, to avoid a separate redundant representation of the tree structure

This is now implemented -- what do you think?

I think this captures of the essence of what you were aiming towards with #9187. Conveniently, it also simplifies the implementation of DataTree.

Thank you - I think this makes the code a lot easier to understand, both in terms of how the coordinates that go into constructing .ds are found, and in terms of the internal attributes of a single DataTree node being simpler.

We're checking alignment against inherited dimensions and indexes, which exactly what xarray.align() checks -- it doesn't worry about data variables at all, or coordinate variables that are not also stored in indexes.

Okay great. We should make this super clear in the documentation.

Note that the repr for the parent includes inherited variables.

[this error message when grandparents don't align] I think is pretty descriptive, though perhaps we could more clearly indicate that it includes all parent nodes, even just by switching "parent" to "parents".

👍 let's try to indicate that the group represent coordinates from the entire tree above.

[having intermediate datasets that are technically invalid is] a little fragile, but these objects never get exposed to users, so the risk is quite contained.

This still makes me a little worried - I would rather have structures that always obey their own contracts by design, but I agree its a problem that we can completely isolate from the users.

xarray/core/datatree.py Outdated Show resolved Hide resolved
xarray/core/datatree.py Outdated Show resolved Hide resolved
xarray/core/datatree.py Outdated Show resolved Hide resolved
xarray/core/datatree.py Show resolved Hide resolved
xarray/core/datatree.py Show resolved Hide resolved
xarray/core/datatree.py Show resolved Hide resolved
xarray/core/datatree.py Show resolved Hide resolved
xarray/core/treenode.py Show resolved Hide resolved
xarray/core/treenode.py Outdated Show resolved Hide resolved
@flamingbear
Copy link
Member

flamingbear commented Jul 1, 2024

say if they are following the general gist of what's being done here 😁

I have been following along, I may be a bit behind. I do like the properties for _coord_variables, _dims and _indexes, I think that captures what Tom was going for in a really clean way.

I have at least one dumb question I will ask inline. Usually I have to actually ask the question before I figure it out.

@shoyer
Copy link
Member Author

shoyer commented Jul 1, 2024

👍 let's try to indicate that the group represent coordinates from the entire tree above.

Here's what you currently see, which I think is pretty clear:

ValueError: group '/b' is not aligned with its parents:
Group:
    Dimensions:  (x: 1)
    Dimensions without coordinates: x
    Data variables:
        c        (x) float64 8B 3.0
From parents:
    Dimensions:  (x: 2)
    Dimensions without coordinates: x

[having intermediate datasets that are technically invalid is] a little fragile, but these objects never get exposed to users, so the risk is quite contained.

This still makes me a little worried - I would rather have structures that always obey their own contracts by design, but I agree its a problem that we can completely isolate from the users.

I don't love this either but I think it's an expedient choice for now. DatasetView is also problematic for similar reasons -- maybe even worse because it's exposed to users.

Copy link
Member

@flamingbear flamingbear left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've followed this though the changes and agree with the decisions, and that we should capture some of the newer tweaks.

Copy link
Collaborator

@keewis keewis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still somewhat uncomfortable with the idea of pulling out inherited coordinates by default when converting a node to a dataset – I would probably have kept the inheritance to DataTree objects by default. I guess my concern is that

xr.open_dataset(..., group="/a")

results in a dataset that is different from either of

xr.open_datatree(...)["/a"].ds
xr.open_datatree(...)["/a"].to_dataset()

which may be pretty confusing at first. This can probably be avoided by changing the repr to mark inherited variables as such (not sure how easy that would be, though, especially since at least the conversion to Dataset would lose that information).

However, other than that this looks good to me, and I don't think this concern is urgent enough to block this PR (and in any case I do realize I'm somewhat late in voicing this concern).

xarray/core/datatree.py Show resolved Hide resolved
@shoyer
Copy link
Member Author

shoyer commented Jul 3, 2024

I guess my concern is that

xr.open_dataset(..., group="/a")

results in a dataset that is different from either of

xr.open_datatree(...)["/a"].ds
xr.open_datatree(...)["/a"].to_dataset()

We are going to have a similar issue when we add group to open_datatree, too: open_datatree(..., group="/a") will be different from open_datatree(...)["/a"].

One way to solve this would be to add an inherited argument to open_dataset and open_datatree, but I'm not sure anybody would actually use it. It's probably better to just to document the discrepancy for now.

In practice, setting the group argument to either open_dataset or open_datatree is something I would expect users to will only do for very large files, in performance critical cases. For these power users, I'm OK with a few more surprising edge cases

@TomNicholas
Copy link
Member

TomNicholas commented Jul 3, 2024

@keewis that's a good point, though I'm not sure what we can do about it other than document the subtlety clearly.

Note that the proposed

xr.open_as_dict_of_datasets(...)["/a"]

in #9137 would behave the way you are expecting, i.e. the same as

xr.open_dataset(..., group="/a")

@shoyer shoyer merged commit a86c3ff into pydata:main Jul 3, 2024
28 checks passed
@shoyer
Copy link
Member Author

shoyer commented Jul 3, 2024

Thanks everyone for your reviews and feedback, especially @TomNicholas ! This definitely improved the final design.

I'm merging this for now so we can start testing and iterating upon it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic-DataTree Related to the implementation of a DataTree class
Projects
Development

Successfully merging this pull request may close these issues.

4 participants