-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hierarchical coordinates in DataTree #9063
Conversation
This is cool to see! And it shows that simply calling However this PR doesn't actually implement the suggestion in #9056, because it places a new strong restriction on the datatree data model. As well as restricting the range of valid # using `main`
import xarray as xr
from xarray.core.datatree import DataTree
# create the unalignable tree from your tests
dt = DataTree.from_dict(
{
"/": xr.Dataset({"a": (("x",), [1, 2])}),
"/b": xr.Dataset({"c": (("x",), [3])}),
}
)
# write a valid netCDF file...
dt.to_netcdf('test.nc')
# switch branch to `datatree-hierarchy`
from xarray.backends.api import open_datatree
dt2 = open_datatree('test.nc') # will raise an error from align ValueError: inconsistent alignment between node and parent datasets: The alternative suggestion I made in #9056 was instead to provide an API which performs inheritance like this, but doesn't apply that restriction to all trees at construction time. I was working on my own PR in #9056 which does a similar alignment check but only when the user calls a specific dt.inherit['x'] # calls `align` once
dt.inherit['x'] # calls `align` again! I think what we actually want is somewhere between the two: we want to calculate all the inheritable coordinates at tree construction time like you've done, but not actually enforce that by raising errors from |
xarray/core/datatree.py
Outdated
def _check_alignment( | ||
node_ds: Dataset, | ||
parent_ds: Dataset | None, | ||
children: Mapping[Hashable, DataTree], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def _check_alignment( | |
node_ds: Dataset, | |
parent_ds: Dataset | None, | |
children: Mapping[Hashable, DataTree], | |
def _find_alignable_coordinates( | |
node_ds: Dataset, | |
parent_ds: Dataset | None, | |
) -> Coordinates: |
Then we would store those coordinates on the node at construction time but not use them unless the user calls .inherit
.
I withdraw this complaint 😁 - see #9077 for discussion as to whether or not to impose this contraint. @shoyer this suggestion might make no sense, but IIUC currently in this PR inherited variables & indexes are literally copied onto child nodes, relying on xarray's shallow copying features to not duplicate any underlying data. What I wonder is whether instead we could simply store the list of names of the inherited variables on the child node as We could hide this detail by having properties like The main tricky part I see with this approach is making sure the correct variable is altered when you try to write to an inherited variable, but I think we have to handle that carefully with either approach. |
xarray/core/datatree.py
Outdated
@@ -480,6 +489,7 @@ def to_dataset(self) -> Dataset: | |||
-------- | |||
DataTree.ds | |||
""" | |||
# TODO: copy these container objects? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: We definitely need tests to verify the to_dataset()
creates an new copy of the node's data. (I suspect that current it does not.)
@TomNicholas CI is passing now, so this should be in a reasonable state for you to take a look! |
Indeed, this is how I implemented it currently. Can you think of use-cases for distinguishing node vs inherited data? I guess this could be a nice option for |
Oh right! I think we might need to be able to distinguish internally when doing map_over_subtree |
I agree, it seems pretty clear that we need support for pulling out "node local" versions of datasets, without inheritance. For this PR, I'll implement a keyword option on I'll also add a test to verify that |
I implemented One messy thing is that it appears that assignment via
So I'm not quite sure how to test modifying coordinates yet. Maybe we'll leave that for a follow-up PR. |
Ah yes - I forgot there's a TODO deep somewhere for that 😅 I left it because it for later because it seemed like it might require changing the |
Hmm, looking again I'm not sure this is true. You have the attributes defining the class DataTree:
_name: str | None
_parent: DataTree | None
_children: dict[str, DataTree]
_cache: dict[str, Any] # used by _CachedAccessor
_local_variables: dict[Hashable, Variable]
_local_coord_names: set[Hashable]
_local_dims: dict[Hashable, int]
_local_indexes: dict[Hashable, Index]
_variables: dict[Hashable, Variable]
_coord_names: set[Hashable]
_dims: dict[Hashable, int]
_indexes: dict[Hashable, Index]
_attrs: dict[Hashable, Any] | None
_encoding: dict[Hashable, Any] | None
_close: Callable[[], None] | None i.e. every local variable (and index) are explicitly stored on every node, and every inherited variable (and index) are also stored explicitly on every node. What I was suggesting was something more like class DataTree
_name: str | None
_parent: DataTree | None
_children: dict[str, DataTree]
_cache: dict[str, Any] # used by _CachedAccessor
_local_variables: dict[Hashable, Variable] # these 4 attributes we already have, they're just renamed
_local_coord_names: set[Hashable]
_local_dims: dict[Hashable, int]
_local_indexes: dict[Hashable, Index]
_inherited_coord_names: set[Hashable] # note this is now the only new attribute needed compared to current non-inheriting model
_attrs: dict[Hashable, Any] | None
_encoding: dict[Hashable, Any] | None
_close: Callable[[], None] | None
@property
def _inherited_coords(self) -> dict[Hashable, Variable]:
# we know these exist somewhere further up the tree
return {self._get_inherited_coord_from_above(coord_name) for coord_name in self._inherited_coord_names}
@property
def variables(self) -> dict[Hashable, Variable]:
# we know these are all alignable because we checked it at construction time
all_variables = {**self._local_variables, **self._inherited_coords}
return all_variables
@property
def _dims(self) -> dict[Hashable, int]:
"""All dims present on local and inherited variables"""
return ... where the set of all variables if defined as a property that looks up the tree to fetch the correct inherited variables (so they aren't stored twice). Basically the point is that the only extra thing I need to know is the names of the variables to inherit, and because I know they definitely exist, I can just generate any combination of local/inherited/all variables dynamically. With this design there are more properties that get built on-the-fly, but there are fewer internal attributes that must be kept consistent with one another. |
This is the solution I have currently implemented. I think it's OK -- the inherited coordinate variables are listed on the repr of the parent node -- though indeed it's not as clear as it could be. Ideally we would say exactly at which level all the conflicting coordinates came from. |
We're checking alignment against inherited dimensions and indexes, which exactly what
Note that the repr for the parent includes inherited variables.
The current error message (from evaluating your snippet on this branch) is:
I think is pretty descriptive, though perhaps we could more clearly indicate that it includes all parent nodes, even just by switching "parent" to "parents". |
# Note: rebuild_dims=False can create technically invalid Dataset | ||
# objects because it may not contain all dimensions on its direct | ||
# member variables, e.g., consider: | ||
# tree = DataTree.from_dict( | ||
# { | ||
# "/": xr.Dataset({"a": (("x",), [1, 2])}), # x has size 2 | ||
# "/b/c": xr.Dataset({"d": (("x",), [3])}), # x has size1 | ||
# } | ||
# ) | ||
# However, they are fine for internal use cases, for align() or | ||
# building a repr(). | ||
dims = dict(self._dims) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@TomNicholas This is where we make the invalid Dataset objects, though they are also propagated by _inherited_dataset()
(if one of the inputs is invalid).
This is now implemented -- what do you think? I think this captures of the essence of what you were aiming towards with #9187. Conveniently, it also simplifies the implementation of DataTree. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the thorough reply here @shoyer ! I think my main concerns are now addressed. 🙏
I'm going to approve, though I think there are a couple of little nits that could be neater. I would also love it if @owenlittlejohns / @flamingbear / @eni-awowale could at least take a look and say if they are following the general gist of what's being done here 😁
The Variable objects in _coord_variables will still be pointed to from multiple places
I see this as a pro, rather than a con. Using the exact same Variable objects means it is possible to modify metadata like attrs in-place .
👍 I think my main concern here was really about having two different representations of the tree at once, and that's dealt with now.
Our list of internal attributes on a
DataTree
node is still not just those on aDataset
plus the inherited coordinatesThis is indeed a bit of a con, but in my mind the right fix is probably to adjust the Dataset data model to using dictionaries of
data_variables
andcoord_variables
, rather than the current solution of a dict ofvariables
and a set ofcoord_names
.
Interesting! I've split out #9203 to track this idea.
The ChainMap objects are just proxies that link them together.
That said, there's no reason why the ChainMap objects need to be constructed when the DataTree objects are built. Instead, it probably makes more sense to do this dynamically, to avoid a separate redundant representation of the tree structure
This is now implemented -- what do you think?
I think this captures of the essence of what you were aiming towards with #9187. Conveniently, it also simplifies the implementation of DataTree.
Thank you - I think this makes the code a lot easier to understand, both in terms of how the coordinates that go into constructing .ds
are found, and in terms of the internal attributes of a single DataTree
node being simpler.
We're checking alignment against inherited dimensions and indexes, which exactly what
xarray.align()
checks -- it doesn't worry about data variables at all, or coordinate variables that are not also stored in indexes.
Okay great. We should make this super clear in the documentation.
Note that the repr for the parent includes inherited variables.
[this error message when grandparents don't align] I think is pretty descriptive, though perhaps we could more clearly indicate that it includes all parent nodes, even just by switching "parent" to "parents".
👍 let's try to indicate that the group represent coordinates from the entire tree above.
[having intermediate datasets that are technically invalid is] a little fragile, but these objects never get exposed to users, so the risk is quite contained.
This still makes me a little worried - I would rather have structures that always obey their own contracts by design, but I agree its a problem that we can completely isolate from the users.
I have been following along, I may be a bit behind. I do like the properties for _coord_variables, _dims and _indexes, I think that captures what Tom was going for in a really clean way.
|
Here's what you currently see, which I think is pretty clear:
I don't love this either but I think it's an expedient choice for now. DatasetView is also problematic for similar reasons -- maybe even worse because it's exposed to users. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've followed this though the changes and agree with the decisions, and that we should capture some of the newer tweaks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still somewhat uncomfortable with the idea of pulling out inherited coordinates by default when converting a node to a dataset – I would probably have kept the inheritance to DataTree
objects by default. I guess my concern is that
xr.open_dataset(..., group="/a")
results in a dataset that is different from either of
xr.open_datatree(...)["/a"].ds
xr.open_datatree(...)["/a"].to_dataset()
which may be pretty confusing at first. This can probably be avoided by changing the repr
to mark inherited variables as such (not sure how easy that would be, though, especially since at least the conversion to Dataset
would lose that information).
However, other than that this looks good to me, and I don't think this concern is urgent enough to block this PR (and in any case I do realize I'm somewhat late in voicing this concern).
We are going to have a similar issue when we add One way to solve this would be to add an In practice, setting the |
Thanks everyone for your reviews and feedback, especially @TomNicholas ! This definitely improved the final design. I'm merging this for now so we can start testing and iterating upon it. |
Proof of concept implementation of #9056
Still probably needs more tests, suggestions welcome!
It's also possible that I've implemented this in a rather inelegant fashion, I am still wrapping my head around the tree-node abstraction.
CC @TomNicholas, @keewis, @owenlittlejohns, @flamingbear, @eni-awowale