Re-implement map_over_datasets using group_subtrees #9636

shoyer · 2024-10-16T16:16:42Z

Copied from shoyer#2

To recap:

It is implemented using zip_subtrees, which means it should properly
handle DataTrees where the nodes are defined in a different order. (closing Why do arithmetic operations between two datatrees depend on the order of subtrees? #9643)
For simplicity, I removed handling of **kwargs, in order to preserve
some flexibility for adding keyword arugments.
I removed automatic skipping of empty nodes, because there are almost
assuredly cases where that would make sense. This could be restored
with a option keyword arugment.

To do:

change map_over_datasets from being called like map_over_datasets(func)(*args) to map_over_datasets(func, *args), which would be more consistent with apply_ufunc.
create an alternative group_subtrees interface that yields tuples (path, nodes) , which will hopefully make it harder to make bugs like Bug fixes for DataTree indexing and aggregation #9626

This should be used for implementing DataTree arithmetic inside map_over_datasets, so the result does not depend on the order in which child nodes are defined. I have also added a minimal implementation of breadth-first-search with an explicit queue the current recursion based solution in xarray.core.iterators (which has been removed). The new implementation is also slightly faster in my microbenchmark: In [1]: import xarray as xr In [2]: tree = xr.DataTree.from_dict({f"/x{i}": None for i in range(100)}) In [3]: %timeit _ = list(tree.subtree) # on main 87.2 μs ± 394 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each) # with this branch 55.1 μs ± 294 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

The main changes: - It is implemented using zip_subtrees, which means it should properly handle DataTrees where the nodes are defined in a different order. - For simplicity, I removed handling of `**kwargs`, in order to preserve some flexibility for adding keyword arugments. - I removed automatic skipping of empty nodes, because there are almost assuredly cases where that would make sense. This could be restored with a option keyword arugment.

xarray/core/datatree_mapping.py

shoyer · 2024-10-16T16:19:33Z

Here are @TomNicholas's comments from my earlier PR: shoyer#2 (review)

It is implemented using zip_subtrees, which means it should properly
handle DataTrees where the nodes are defined in a different order.

👍

For simplicity, I removed handling of **kwargs, in order to preserve
some flexibility for adding keyword arguments.

So again the idea here is that that is more similar to apply_ufunc?

I removed automatic skipping of empty nodes, because there are almost
assuredly cases where that would make sense. This could be restored
with a option keyword arugment.

I will look at this in more detail in a bit, but there are definitely issues on the original datatree repo caused by not skipping empty nodes.

The issue I was thinking of was xarray-contrib/datatree#262, but actually that would also be handled by pydata#9588.

I wonder if we should also change map_over_datasets from being called like map_over_datasets(func)(*args) to map_over_datasets(func, *args), which would be more consistent with apply_ufunc.

👍 The existing signature was motivated by me wanting to decorate inherited methods, but we don't really need that now. Consistency with apply_ufunc is a good point though. I agree that map_over_datasets(func, *args) might be better - that's also more similar to how the method version DataTree.map_over_datasets currently works, and similar to Dataset.map.

shoyer · 2024-10-19T01:58:31Z

This has gotten a bit bigger. I've also:

reimplemented DataTree.isomorphic to use group_subtrees
added a new iterator subtree_with_path
reimplemented filter and match using subtree_with_path (to fix a bug when applying operations to a subtree)
updated the documentation

shoyer · 2024-10-19T02:04:19Z

Anyways this is ready for review!

TomNicholas · 2024-10-19T18:07:05Z

xarray/core/treenode.py

+            path, node = queue.popleft()
+            yield path, node
+            queue.extend(
+                (os.path.join(path, name), child)


I bet this is what's causing the test failures on Windows. Paths between nodes cannot be treated the same as filesystem paths! That's why I made NodePath, so you could just use that here to fix it.

Another reason to hide path manipulation logic behind an abstraction is so that we can generalize it later to handle / reject Hashables. #8836 (comment)

You got there just as I was reviewing! So this comment should be fixed by 1f07b63

TomNicholas · 2024-10-19T18:15:12Z

doc/user-guide/hierarchical-data.rst

+A very useful pattern is to iterate over :py:class:`~xarray.DataTree.subtree_with_keys`
+to manipulate nodes however you wish, then rebuild a new tree using
+:py:meth:`xarray.DataTree.from_dict()`.


So we have a new property:

assert DataTree.from_dict(dt.subtree_with_keys()) == dt

Is there a good place we could document these properties?

TomNicholas · 2024-10-19T18:16:35Z

doc/user-guide/hierarchical-data.rst

    xr.DataTree.from_dict(non_empty_nodes)

 You can see this tree is similar to the ``dt`` object above, except that it is missing the empty nodes ``a/c`` and ``a/c/d``.

-(If you want to keep the name of the root node, you will need to add the ``name`` kwarg to :py:class:`~xarray.DataTree.from_dict`, i.e. ``DataTree.from_dict(non_empty_nodes, name=dt.root.name)``.)
+(If you want to keep the name of the root node, you will need to add the ``name`` kwarg to :py:class:`~xarray.DataTree.from_dict`, i.e. ``DataTree.from_dict(non_empty_nodes, name=dt.name)``.)


Good catch!

TomNicholas

Great! I have a large number of small comments. The most important one is that there is an unexpected error in one of the docs examples.

We should discuss if there is much more to do before releasing now.

doc/user-guide/hierarchical-data.rst

TomNicholas · 2024-10-19T19:29:24Z

xarray/core/datatree.py

-        full_file_like_paths_to_all_nodes_in_subtree = {
-            node.path[1:]: node for node in self.subtree
+        paths_to_all_nodes_in_subtree = {
+            path: node for path, node in self.subtree_with_keys if path


I'm not sure I understand this change. Why would bool(path) ever not be True? Even the root will have a non-empty string '/'.

Also the [1:] was there to remove the preceding /, so that accessing via getitem would use relative paths.

We don't need [1:] because subtree_with_keys iterates over relative paths.

The relative path path="" is used for the root node.

We don't need [1:] because subtree_with_keys iterates over relative paths.

👍

The relative path path="" is used for the root node.

But we're about to change that. (#9636 (comment))

xarray/core/datatree.py

xarray/core/datatree_mapping.py

TomNicholas · 2024-10-19T20:42:49Z

xarray/core/treenode.py

    """
    if not trees:
        raise TypeError("must pass at least one tree object")

    # https://en.wikipedia.org/wiki/Breadth-first_search#Pseudocode
-    queue = collections.deque([trees])
+    queue = collections.deque([("", trees)])


I guess this answers my question about why bool(path) would ever be False. But why are we doing this exactly?

We could equivalently return "", ".", "/" or ./ for the top-level node. Or we could make None a sentinel value recognized by from_dict.

Given that the others are relative paths, I would lean towards the strings "" or ".". I guess "." might make more sense since it's the canonical path returned by NodePath?

We could equivalently return "", ".", "/" or ./ for the top-level node

We shouldn't return "/", because to me that would indicate an absolute, not relative path. The current code in this branch seems to handle that non-root situation just fine:

In [21]: dt1 = xr.DataTree.from_dict({"root/c/a": xr.Dataset({"x": 1}), "root/c/b": xr.Dataset({"x": 2})}) In [22]: dt2 = xr.DataTree.from_dict( ...: {"root/c/a": xr.Dataset({"x": 10}), "root/c/b": xr.Dataset({"x": 20})} ...: ) In [23]: result = {} In [24]: for path, (node1, node2) in xr.group_subtrees(dt1['root'], dt2['root']): ...: print(path) ...: print(node1.dataset) ...: result[path] = node1.dataset + node2.dataset ...: . <xarray.DatasetView> Size: 0B Dimensions: () Data variables: *empty* c <xarray.DatasetView> Size: 0B Dimensions: () Data variables: *empty* c/a <xarray.DatasetView> Size: 8B Dimensions: () Data variables: x int64 8B 1 c/b <xarray.DatasetView> Size: 8B Dimensions: () Data variables: x int64 8B 2

That looks good to me - the returned paths are clearly relative to dt1['root'], not the actual root (dt).

I guess "." might make more sense since it's the canonical path returned by NodePath?

I think it should be ., because its generally nice and intuitive to match the behaviour of pathlib (xref #9448):

In [1]: from pathlib import PurePath In [2]: PurePath('a/b/').relative_to('a/b/') Out[2]: PurePosixPath('.') In [3]: PurePath('a/b/').relative_to('a/') Out[3]: PurePosixPath('b') In [4]: PurePath('a/').relative_to('a/b', walk_up=True) # requires python 3.12 Out[4]: PurePosixPath('..') In [5]: PurePath('a/b/').relative_to('.') Out[5]: PurePosixPath('a/b')

NodePath just inherits that behaviour.

I can also see an argument for './', because then direct string concatenation with a/b gives a valid relative path.

xarray/tests/test_datatree_mapping.py

shoyer · 2024-10-20T17:44:26Z

thanks for the careful review! I think I resolved most of the issues, please take another look

TomNicholas · 2024-10-20T20:16:51Z

I'm happy with it now that we're returning "."!

shoyer · 2024-10-21T15:51:08Z

Submitting this so it doesn't go stale! Let's continue iterating on these ideas :)

* main: Fix multiple grouping with missing groups (pydata#9650) flox: Properly propagate multiindex (pydata#9649) Update Datatree html repr to indicate inheritance (pydata#9633) Re-implement map_over_datasets using group_subtrees (pydata#9636) fix zarr intersphinx (pydata#9652) Replace black and blackdoc with ruff-format (pydata#9506) Fix error and missing code cell in io.rst (pydata#9641) Support alternative names for the root node in DataTree.from_dict (pydata#9638) Updates to DataTree.equals and DataTree.identical (pydata#9627) DOC: Clarify error message in open_dataarray (pydata#9637) Add zip_subtrees for paired iteration over DataTrees (pydata#9623) Type check datatree tests (pydata#9632) Add missing `memo` argument to DataTree.__deepcopy__ (pydata#9631) Bug fixes for DataTree indexing and aggregation (pydata#9626) Add inherit=False option to DataTree.copy() (pydata#9628) docs(groupby): mention deprecation of `squeeze` kwarg (pydata#9625) Migration guide for users of old datatree repo (pydata#9598) Reimplement Datatree typed ops (pydata#9619)

* main: (63 commits) Add close() method to DataTree and use it to clean-up open files in tests (pydata#9651) Change URL for pydap test (pydata#9655) Fix multiple grouping with missing groups (pydata#9650) flox: Properly propagate multiindex (pydata#9649) Update Datatree html repr to indicate inheritance (pydata#9633) Re-implement map_over_datasets using group_subtrees (pydata#9636) fix zarr intersphinx (pydata#9652) Replace black and blackdoc with ruff-format (pydata#9506) Fix error and missing code cell in io.rst (pydata#9641) Support alternative names for the root node in DataTree.from_dict (pydata#9638) Updates to DataTree.equals and DataTree.identical (pydata#9627) DOC: Clarify error message in open_dataarray (pydata#9637) Add zip_subtrees for paired iteration over DataTrees (pydata#9623) Type check datatree tests (pydata#9632) Add missing `memo` argument to DataTree.__deepcopy__ (pydata#9631) Bug fixes for DataTree indexing and aggregation (pydata#9626) Add inherit=False option to DataTree.copy() (pydata#9628) docs(groupby): mention deprecation of `squeeze` kwarg (pydata#9625) Migration guide for users of old datatree repo (pydata#9598) Reimplement Datatree typed ops (pydata#9619) ...

shoyer added 5 commits October 15, 2024 15:45

fix pytype error

23da8ca

Merge branch 'main' into zip_subtree

4480e11

Merge branch 'main' into zip_subtree_map

bed8cba

shoyer commented Oct 16, 2024

View reviewed changes

xarray/core/datatree_mapping.py Outdated Show resolved Hide resolved

shoyer marked this pull request as draft October 16, 2024 16:17

TomNicholas added the topic-DataTree Related to the implementation of a DataTree class label Oct 16, 2024

shoyer added 2 commits October 16, 2024 15:51

fix typing of map_over_datasets

4353581

add group_subtrees

1aa7601

TomNicholas mentioned this pull request Oct 18, 2024

Why do arithmetic operations between two datatrees depend on the order of subtrees? #9643

Closed

shoyer added 7 commits October 18, 2024 16:04

wip fixes

89ea46e

Merge branch 'main' into zip_subtree_map

16ef362

update isomorphic

93ba3a1

documentation and API change for map_over_datasets

e4bc1a0

mypy fixes

3b5a41b

fix test

5cc7e8f

diff formatting

8ef0522

shoyer marked this pull request as ready for review October 19, 2024 01:29

shoyer changed the title ~~Re-implement map_over_datasets using zip_subtrees~~ Re-implement map_over_datasets using group_subtrees Oct 19, 2024

shoyer added 3 commits October 18, 2024 21:52

more mypy

1f931ff

doc fix

5a99811

more doc fix

bd976f6

add api docs

dd0280d

add utility for joining path on windows

1f07b63

TomNicholas reviewed Oct 19, 2024

View reviewed changes

docstring

ab81dcf

add an overload for two return values from map_over_datasets

74119c3

TomNicholas reviewed Oct 19, 2024

View reviewed changes

TomNicholas approved these changes Oct 19, 2024

View reviewed changes

shoyer added 3 commits October 19, 2024 16:55

partial fixes per review

b93c46e

fixes per review

fca6780

remove a couple of xfails

b681181

Merge branch 'main' into zip_subtree_map

b9b3f3e

shoyer merged commit e58edcc into pydata:main Oct 21, 2024
29 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Re-implement map_over_datasets using group_subtrees #9636

Re-implement map_over_datasets using group_subtrees #9636

shoyer commented Oct 16, 2024 •

edited

Loading

shoyer commented Oct 16, 2024

shoyer commented Oct 19, 2024

shoyer commented Oct 19, 2024

TomNicholas Oct 19, 2024

TomNicholas Oct 19, 2024

TomNicholas Oct 19, 2024

TomNicholas Oct 19, 2024

TomNicholas Oct 19, 2024

TomNicholas left a comment

TomNicholas Oct 19, 2024

shoyer Oct 19, 2024

TomNicholas Oct 20, 2024

TomNicholas Oct 19, 2024

shoyer Oct 20, 2024

TomNicholas Oct 20, 2024

TomNicholas Oct 20, 2024

shoyer commented Oct 20, 2024

TomNicholas commented Oct 20, 2024

shoyer commented Oct 21, 2024

Re-implement map_over_datasets using group_subtrees #9636

Re-implement map_over_datasets using group_subtrees #9636

Conversation

shoyer commented Oct 16, 2024 • edited Loading

shoyer commented Oct 16, 2024

shoyer commented Oct 19, 2024

shoyer commented Oct 19, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TomNicholas left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shoyer commented Oct 20, 2024

TomNicholas commented Oct 20, 2024

shoyer commented Oct 21, 2024

shoyer commented Oct 16, 2024 •

edited

Loading