-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Re-implement map_over_datasets using group_subtrees #9636
Conversation
This should be used for implementing DataTree arithmetic inside map_over_datasets, so the result does not depend on the order in which child nodes are defined. I have also added a minimal implementation of breadth-first-search with an explicit queue the current recursion based solution in xarray.core.iterators (which has been removed). The new implementation is also slightly faster in my microbenchmark: In [1]: import xarray as xr In [2]: tree = xr.DataTree.from_dict({f"/x{i}": None for i in range(100)}) In [3]: %timeit _ = list(tree.subtree) # on main 87.2 μs ± 394 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each) # with this branch 55.1 μs ± 294 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
The main changes: - It is implemented using zip_subtrees, which means it should properly handle DataTrees where the nodes are defined in a different order. - For simplicity, I removed handling of `**kwargs`, in order to preserve some flexibility for adding keyword arugments. - I removed automatic skipping of empty nodes, because there are almost assuredly cases where that would make sense. This could be restored with a option keyword arugment.
Here are @TomNicholas's comments from my earlier PR: shoyer#2 (review)
|
This has gotten a bit bigger. I've also:
|
Anyways this is ready for review! |
xarray/core/treenode.py
Outdated
path, node = queue.popleft() | ||
yield path, node | ||
queue.extend( | ||
(os.path.join(path, name), child) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I bet this is what's causing the test failures on Windows. Paths between nodes cannot be treated the same as filesystem paths! That's why I made NodePath
, so you could just use that here to fix it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another reason to hide path manipulation logic behind an abstraction is so that we can generalize it later to handle / reject Hashables
. #8836 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You got there just as I was reviewing! So this comment should be fixed by 1f07b63
A very useful pattern is to iterate over :py:class:`~xarray.DataTree.subtree_with_keys` | ||
to manipulate nodes however you wish, then rebuild a new tree using | ||
:py:meth:`xarray.DataTree.from_dict()`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we have a new property:
assert DataTree.from_dict(dt.subtree_with_keys()) == dt
Is there a good place we could document these properties?
xr.DataTree.from_dict(non_empty_nodes) | ||
|
||
You can see this tree is similar to the ``dt`` object above, except that it is missing the empty nodes ``a/c`` and ``a/c/d``. | ||
|
||
(If you want to keep the name of the root node, you will need to add the ``name`` kwarg to :py:class:`~xarray.DataTree.from_dict`, i.e. ``DataTree.from_dict(non_empty_nodes, name=dt.root.name)``.) | ||
(If you want to keep the name of the root node, you will need to add the ``name`` kwarg to :py:class:`~xarray.DataTree.from_dict`, i.e. ``DataTree.from_dict(non_empty_nodes, name=dt.name)``.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great! I have a large number of small comments. The most important one is that there is an unexpected error in one of the docs examples.
We should discuss if there is much more to do before releasing now.
xarray/core/datatree.py
Outdated
full_file_like_paths_to_all_nodes_in_subtree = { | ||
node.path[1:]: node for node in self.subtree | ||
paths_to_all_nodes_in_subtree = { | ||
path: node for path, node in self.subtree_with_keys if path |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand this change. Why would bool(path)
ever not be True
? Even the root will have a non-empty string '/'
.
Also the [1:]
was there to remove the preceding /
, so that accessing via getitem would use relative paths.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need [1:]
because subtree_with_keys
iterates over relative paths.
The relative path path=""
is used for the root node.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need [1:] because subtree_with_keys iterates over relative paths.
👍
The relative path
path=""
is used for the root node.
But we're about to change that. (#9636 (comment))
xarray/core/treenode.py
Outdated
""" | ||
if not trees: | ||
raise TypeError("must pass at least one tree object") | ||
|
||
# https://en.wikipedia.org/wiki/Breadth-first_search#Pseudocode | ||
queue = collections.deque([trees]) | ||
queue = collections.deque([("", trees)]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this answers my question about why bool(path)
would ever be False
. But why are we doing this exactly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could equivalently return ""
, "."
, "/"
or ./
for the top-level node. Or we could make None
a sentinel value recognized by from_dict
.
Given that the others are relative paths, I would lean towards the strings ""
or "."
. I guess "."
might make more sense since it's the canonical path returned by NodePath
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could equivalently return "", ".", "/" or ./ for the top-level node
We shouldn't return "/"
, because to me that would indicate an absolute, not relative path. The current code in this branch seems to handle that non-root situation just fine:
In [21]: dt1 = xr.DataTree.from_dict({"root/c/a": xr.Dataset({"x": 1}), "root/c/b": xr.Dataset({"x": 2})})
In [22]: dt2 = xr.DataTree.from_dict(
...: {"root/c/a": xr.Dataset({"x": 10}), "root/c/b": xr.Dataset({"x": 20})}
...: )
In [23]: result = {}
In [24]: for path, (node1, node2) in xr.group_subtrees(dt1['root'], dt2['root']):
...: print(path)
...: print(node1.dataset)
...: result[path] = node1.dataset + node2.dataset
...:
.
<xarray.DatasetView> Size: 0B
Dimensions: ()
Data variables:
*empty*
c
<xarray.DatasetView> Size: 0B
Dimensions: ()
Data variables:
*empty*
c/a
<xarray.DatasetView> Size: 8B
Dimensions: ()
Data variables:
x int64 8B 1
c/b
<xarray.DatasetView> Size: 8B
Dimensions: ()
Data variables:
x int64 8B 2
That looks good to me - the returned paths are clearly relative to dt1['root']
, not the actual root (dt
).
I guess "." might make more sense since it's the canonical path returned by NodePath?
I think it should be .
, because its generally nice and intuitive to match the behaviour of pathlib
(xref #9448):
In [1]: from pathlib import PurePath
In [2]: PurePath('a/b/').relative_to('a/b/')
Out[2]: PurePosixPath('.')
In [3]: PurePath('a/b/').relative_to('a/')
Out[3]: PurePosixPath('b')
In [4]: PurePath('a/').relative_to('a/b', walk_up=True) # requires python 3.12
Out[4]: PurePosixPath('..')
In [5]: PurePath('a/b/').relative_to('.')
Out[5]: PurePosixPath('a/b')
NodePath
just inherits that behaviour.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can also see an argument for './'
, because then direct string concatenation with a/b
gives a valid relative path.
thanks for the careful review! I think I resolved most of the issues, please take another look |
I'm happy with it now that we're returning |
Submitting this so it doesn't go stale! Let's continue iterating on these ideas :) |
* main: Fix multiple grouping with missing groups (pydata#9650) flox: Properly propagate multiindex (pydata#9649) Update Datatree html repr to indicate inheritance (pydata#9633) Re-implement map_over_datasets using group_subtrees (pydata#9636) fix zarr intersphinx (pydata#9652) Replace black and blackdoc with ruff-format (pydata#9506) Fix error and missing code cell in io.rst (pydata#9641) Support alternative names for the root node in DataTree.from_dict (pydata#9638) Updates to DataTree.equals and DataTree.identical (pydata#9627) DOC: Clarify error message in open_dataarray (pydata#9637) Add zip_subtrees for paired iteration over DataTrees (pydata#9623) Type check datatree tests (pydata#9632) Add missing `memo` argument to DataTree.__deepcopy__ (pydata#9631) Bug fixes for DataTree indexing and aggregation (pydata#9626) Add inherit=False option to DataTree.copy() (pydata#9628) docs(groupby): mention deprecation of `squeeze` kwarg (pydata#9625) Migration guide for users of old datatree repo (pydata#9598) Reimplement Datatree typed ops (pydata#9619)
* main: (63 commits) Add close() method to DataTree and use it to clean-up open files in tests (pydata#9651) Change URL for pydap test (pydata#9655) Fix multiple grouping with missing groups (pydata#9650) flox: Properly propagate multiindex (pydata#9649) Update Datatree html repr to indicate inheritance (pydata#9633) Re-implement map_over_datasets using group_subtrees (pydata#9636) fix zarr intersphinx (pydata#9652) Replace black and blackdoc with ruff-format (pydata#9506) Fix error and missing code cell in io.rst (pydata#9641) Support alternative names for the root node in DataTree.from_dict (pydata#9638) Updates to DataTree.equals and DataTree.identical (pydata#9627) DOC: Clarify error message in open_dataarray (pydata#9637) Add zip_subtrees for paired iteration over DataTrees (pydata#9623) Type check datatree tests (pydata#9632) Add missing `memo` argument to DataTree.__deepcopy__ (pydata#9631) Bug fixes for DataTree indexing and aggregation (pydata#9626) Add inherit=False option to DataTree.copy() (pydata#9628) docs(groupby): mention deprecation of `squeeze` kwarg (pydata#9625) Migration guide for users of old datatree repo (pydata#9598) Reimplement Datatree typed ops (pydata#9619) ...
Copied from shoyer#2
Fixes #9643
To recap:
handle DataTrees where the nodes are defined in a different order. (closing Why do arithmetic operations between two datatrees depend on the order of subtrees? #9643)
**kwargs
, in order to preservesome flexibility for adding keyword arugments.
assuredly cases where that would make sense. This could be restored
with a option keyword arugment.
To do:
map_over_datasets
from being called likemap_over_datasets(func)(*args)
tomap_over_datasets(func, *args)
, which would be more consistent with apply_ufunc.group_subtrees
interface that yields tuples(path, nodes)
, which will hopefully make it harder to make bugs like Bug fixes for DataTree indexing and aggregation #9626