Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Re-implement map_over_datasets using group_subtrees #9636

Merged
merged 25 commits into from
Oct 21, 2024

Conversation

shoyer
Copy link
Member

@shoyer shoyer commented Oct 16, 2024

Copied from shoyer#2

Fixes #9643

To recap:

  • It is implemented using zip_subtrees, which means it should properly
    handle DataTrees where the nodes are defined in a different order. (closing Why do arithmetic operations between two datatrees depend on the order of subtrees? #9643)
  • For simplicity, I removed handling of **kwargs, in order to preserve
    some flexibility for adding keyword arugments.
  • I removed automatic skipping of empty nodes, because there are almost
    assuredly cases where that would make sense. This could be restored
    with a option keyword arugment.

To do:

  • change map_over_datasets from being called like map_over_datasets(func)(*args) to map_over_datasets(func, *args), which would be more consistent with apply_ufunc.
  • create an alternative group_subtrees interface that yields tuples (path, nodes) , which will hopefully make it harder to make bugs like Bug fixes for DataTree indexing and aggregation #9626

This should be used for implementing DataTree arithmetic inside
map_over_datasets, so the result does not depend on the order in which
child nodes are defined.

I have also added a minimal implementation of breadth-first-search with
an explicit queue the current recursion based solution in
xarray.core.iterators (which has been removed). The new implementation
is also slightly faster in my microbenchmark:

    In [1]: import xarray as xr

    In [2]: tree = xr.DataTree.from_dict({f"/x{i}": None for i in range(100)})

    In [3]: %timeit _ = list(tree.subtree)
    # on main
    87.2 μs ± 394 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

    # with this branch
    55.1 μs ± 294 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
The main changes:

- It is implemented using zip_subtrees, which means it should properly
  handle DataTrees where the nodes are defined in a different order.
- For simplicity, I removed handling of `**kwargs`, in order to preserve
  some flexibility for adding keyword arugments.
- I removed automatic skipping of empty nodes, because there are almost
  assuredly cases where that would make sense. This could be restored
  with a option keyword arugment.
@shoyer shoyer marked this pull request as draft October 16, 2024 16:17
@shoyer
Copy link
Member Author

shoyer commented Oct 16, 2024

Here are @TomNicholas's comments from my earlier PR: shoyer#2 (review)

It is implemented using zip_subtrees, which means it should properly
handle DataTrees where the nodes are defined in a different order.

👍

For simplicity, I removed handling of **kwargs, in order to preserve
some flexibility for adding keyword arguments.

So again the idea here is that that is more similar to apply_ufunc?

I removed automatic skipping of empty nodes, because there are almost
assuredly cases where that would make sense. This could be restored
with a option keyword arugment.

I will look at this in more detail in a bit, but there are definitely issues on the original datatree repo caused by not skipping empty nodes.

The issue I was thinking of was xarray-contrib/datatree#262, but actually that would also be handled by pydata#9588.

I wonder if we should also change map_over_datasets from being called like map_over_datasets(func)(*args) to map_over_datasets(func, *args), which would be more consistent with apply_ufunc.

👍 The existing signature was motivated by me wanting to decorate inherited methods, but we don't really need that now. Consistency with apply_ufunc is a good point though. I agree that map_over_datasets(func, *args) might be better - that's also more similar to how the method version DataTree.map_over_datasets currently works, and similar to Dataset.map.

@TomNicholas TomNicholas added the topic-DataTree Related to the implementation of a DataTree class label Oct 16, 2024
@shoyer shoyer marked this pull request as ready for review October 19, 2024 01:29
@shoyer shoyer changed the title Re-implement map_over_datasets using zip_subtrees Re-implement map_over_datasets using group_subtrees Oct 19, 2024
@shoyer
Copy link
Member Author

shoyer commented Oct 19, 2024

This has gotten a bit bigger. I've also:

  1. reimplemented DataTree.isomorphic to use group_subtrees
  2. added a new iterator subtree_with_path
  3. reimplemented filter and match using subtree_with_path (to fix a bug when applying operations to a subtree)
  4. updated the documentation

@shoyer
Copy link
Member Author

shoyer commented Oct 19, 2024

Anyways this is ready for review!

path, node = queue.popleft()
yield path, node
queue.extend(
(os.path.join(path, name), child)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I bet this is what's causing the test failures on Windows. Paths between nodes cannot be treated the same as filesystem paths! That's why I made NodePath, so you could just use that here to fix it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another reason to hide path manipulation logic behind an abstraction is so that we can generalize it later to handle / reject Hashables. #8836 (comment)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You got there just as I was reviewing! So this comment should be fixed by 1f07b63

Comment on lines +368 to +370
A very useful pattern is to iterate over :py:class:`~xarray.DataTree.subtree_with_keys`
to manipulate nodes however you wish, then rebuild a new tree using
:py:meth:`xarray.DataTree.from_dict()`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we have a new property:

assert DataTree.from_dict(dt.subtree_with_keys()) == dt

Is there a good place we could document these properties?

xr.DataTree.from_dict(non_empty_nodes)

You can see this tree is similar to the ``dt`` object above, except that it is missing the empty nodes ``a/c`` and ``a/c/d``.

(If you want to keep the name of the root node, you will need to add the ``name`` kwarg to :py:class:`~xarray.DataTree.from_dict`, i.e. ``DataTree.from_dict(non_empty_nodes, name=dt.root.name)``.)
(If you want to keep the name of the root node, you will need to add the ``name`` kwarg to :py:class:`~xarray.DataTree.from_dict`, i.e. ``DataTree.from_dict(non_empty_nodes, name=dt.name)``.)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch!

Copy link
Member

@TomNicholas TomNicholas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! I have a large number of small comments. The most important one is that there is an unexpected error in one of the docs examples.

We should discuss if there is much more to do before releasing now.

doc/user-guide/hierarchical-data.rst Outdated Show resolved Hide resolved
doc/user-guide/hierarchical-data.rst Outdated Show resolved Hide resolved
doc/user-guide/hierarchical-data.rst Outdated Show resolved Hide resolved
full_file_like_paths_to_all_nodes_in_subtree = {
node.path[1:]: node for node in self.subtree
paths_to_all_nodes_in_subtree = {
path: node for path, node in self.subtree_with_keys if path
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand this change. Why would bool(path) ever not be True? Even the root will have a non-empty string '/'.

Also the [1:] was there to remove the preceding /, so that accessing via getitem would use relative paths.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need [1:] because subtree_with_keys iterates over relative paths.

The relative path path="" is used for the root node.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need [1:] because subtree_with_keys iterates over relative paths.

👍

The relative path path="" is used for the root node.

But we're about to change that. (#9636 (comment))

xarray/core/datatree.py Show resolved Hide resolved
xarray/core/datatree_mapping.py Show resolved Hide resolved
xarray/core/datatree_mapping.py Show resolved Hide resolved
xarray/core/datatree_mapping.py Outdated Show resolved Hide resolved
"""
if not trees:
raise TypeError("must pass at least one tree object")

# https://en.wikipedia.org/wiki/Breadth-first_search#Pseudocode
queue = collections.deque([trees])
queue = collections.deque([("", trees)])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this answers my question about why bool(path) would ever be False. But why are we doing this exactly?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could equivalently return "", ".", "/" or ./ for the top-level node. Or we could make None a sentinel value recognized by from_dict.

Given that the others are relative paths, I would lean towards the strings "" or ".". I guess "." might make more sense since it's the canonical path returned by NodePath?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could equivalently return "", ".", "/" or ./ for the top-level node

We shouldn't return "/", because to me that would indicate an absolute, not relative path. The current code in this branch seems to handle that non-root situation just fine:

In [21]: dt1 = xr.DataTree.from_dict({"root/c/a": xr.Dataset({"x": 1}), "root/c/b": xr.Dataset({"x": 2})})

In [22]: dt2 = xr.DataTree.from_dict(
    ...:     {"root/c/a": xr.Dataset({"x": 10}), "root/c/b": xr.Dataset({"x": 20})}
    ...: )

In [23]: result = {}

In [24]: for path, (node1, node2) in xr.group_subtrees(dt1['root'], dt2['root']):
    ...:     print(path)
    ...:     print(node1.dataset)
    ...:     result[path] = node1.dataset + node2.dataset
    ...: 
.
<xarray.DatasetView> Size: 0B
Dimensions:  ()
Data variables:
    *empty*
c
<xarray.DatasetView> Size: 0B
Dimensions:  ()
Data variables:
    *empty*
c/a
<xarray.DatasetView> Size: 8B
Dimensions:  ()
Data variables:
    x        int64 8B 1
c/b
<xarray.DatasetView> Size: 8B
Dimensions:  ()
Data variables:
    x        int64 8B 2

That looks good to me - the returned paths are clearly relative to dt1['root'], not the actual root (dt).

I guess "." might make more sense since it's the canonical path returned by NodePath?

I think it should be ., because its generally nice and intuitive to match the behaviour of pathlib (xref #9448):

In [1]: from pathlib import PurePath

In [2]: PurePath('a/b/').relative_to('a/b/')
Out[2]: PurePosixPath('.')

In [3]: PurePath('a/b/').relative_to('a/')
Out[3]: PurePosixPath('b')

In [4]: PurePath('a/').relative_to('a/b', walk_up=True)  # requires python 3.12
Out[4]: PurePosixPath('..')

In [5]: PurePath('a/b/').relative_to('.')
Out[5]: PurePosixPath('a/b')

NodePath just inherits that behaviour.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can also see an argument for './', because then direct string concatenation with a/b gives a valid relative path.

xarray/tests/test_datatree_mapping.py Outdated Show resolved Hide resolved
@shoyer
Copy link
Member Author

shoyer commented Oct 20, 2024

thanks for the careful review! I think I resolved most of the issues, please take another look

@TomNicholas
Copy link
Member

I'm happy with it now that we're returning "."!

@shoyer shoyer merged commit e58edcc into pydata:main Oct 21, 2024
29 checks passed
@shoyer
Copy link
Member Author

shoyer commented Oct 21, 2024

Submitting this so it doesn't go stale! Let's continue iterating on these ideas :)

dcherian added a commit to TomAugspurger/xarray that referenced this pull request Oct 21, 2024
* main:
  Fix multiple grouping with missing groups (pydata#9650)
  flox: Properly propagate multiindex (pydata#9649)
  Update Datatree html repr to indicate inheritance (pydata#9633)
  Re-implement map_over_datasets using group_subtrees (pydata#9636)
  fix zarr intersphinx (pydata#9652)
  Replace black and blackdoc with ruff-format (pydata#9506)
  Fix error and missing code cell in io.rst (pydata#9641)
  Support alternative names for the root node in DataTree.from_dict (pydata#9638)
  Updates to DataTree.equals and DataTree.identical (pydata#9627)
  DOC: Clarify error message in open_dataarray (pydata#9637)
  Add zip_subtrees for paired iteration over DataTrees (pydata#9623)
  Type check datatree tests (pydata#9632)
  Add missing `memo` argument to DataTree.__deepcopy__ (pydata#9631)
  Bug fixes for DataTree indexing and aggregation (pydata#9626)
  Add inherit=False option to DataTree.copy() (pydata#9628)
  docs(groupby): mention deprecation of `squeeze` kwarg (pydata#9625)
  Migration guide for users of old datatree repo (pydata#9598)
  Reimplement Datatree typed ops (pydata#9619)
dcherian added a commit to dcherian/xarray that referenced this pull request Oct 22, 2024
* main: (63 commits)
  Add close() method to DataTree and use it to clean-up open files in tests (pydata#9651)
  Change URL for pydap test (pydata#9655)
  Fix multiple grouping with missing groups (pydata#9650)
  flox: Properly propagate multiindex (pydata#9649)
  Update Datatree html repr to indicate inheritance (pydata#9633)
  Re-implement map_over_datasets using group_subtrees (pydata#9636)
  fix zarr intersphinx (pydata#9652)
  Replace black and blackdoc with ruff-format (pydata#9506)
  Fix error and missing code cell in io.rst (pydata#9641)
  Support alternative names for the root node in DataTree.from_dict (pydata#9638)
  Updates to DataTree.equals and DataTree.identical (pydata#9627)
  DOC: Clarify error message in open_dataarray (pydata#9637)
  Add zip_subtrees for paired iteration over DataTrees (pydata#9623)
  Type check datatree tests (pydata#9632)
  Add missing `memo` argument to DataTree.__deepcopy__ (pydata#9631)
  Bug fixes for DataTree indexing and aggregation (pydata#9626)
  Add inherit=False option to DataTree.copy() (pydata#9628)
  docs(groupby): mention deprecation of `squeeze` kwarg (pydata#9625)
  Migration guide for users of old datatree repo (pydata#9598)
  Reimplement Datatree typed ops (pydata#9619)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic-DataTree Related to the implementation of a DataTree class
Projects
Development

Successfully merging this pull request may close these issues.

Why do arithmetic operations between two datatrees depend on the order of subtrees?
2 participants