Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrate treenode module. #8757

Merged
merged 25 commits into from
Feb 27, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
25 commits
Select commit Hold shift + click to select a range
065db0e
Update the formating tests
flamingbear Feb 13, 2024
f167b99
Migrate treenode module
flamingbear Feb 14, 2024
6bd492f
Update NotFoundInTreeError description.
flamingbear Feb 15, 2024
b62a21a
Reformat some comments
flamingbear Feb 15, 2024
32053b6
Updates whats-new.rst
flamingbear Feb 15, 2024
32e7453
mypy typing. (terrible?)
flamingbear Feb 15, 2024
7f2a178
Adds __repr__ to NamedNode and updates test
flamingbear Feb 16, 2024
b4fb773
Merge remote-tracking branch 'pydata/main' into mhs/migrate_treenode
flamingbear Feb 16, 2024
548536b
Adds quotes to NamedNode __str__ representation.
flamingbear Feb 16, 2024
b9685d7
swaps " for ' in NamedNode __str__ representation.
flamingbear Feb 16, 2024
59da654
Adding Tom in so he gets blamed properly.
flamingbear Feb 16, 2024
b821fca
Merge remote-tracking branch 'pydata/main' into mhs/migrate_treenode
flamingbear Feb 19, 2024
4530973
resolve conflict whats-new.rst
flamingbear Feb 19, 2024
c03d373
Moves test_treenode.py to xarray/tests.
flamingbear Feb 19, 2024
ccd5374
Merge branch 'main' into mhs/migrate_treenode
flamingbear Feb 20, 2024
4830cd4
refactors backend tests for datatree IO
flamingbear Feb 20, 2024
e8459bc
Add explicit engine back in test_to_zarr
flamingbear Feb 20, 2024
4238ce2
Removes OrderedDict from treenode
flamingbear Feb 21, 2024
db264c0
Renames tests/test_io.py -> tests/test_backends_datatree.py
flamingbear Feb 21, 2024
b4acb0d
typo
flamingbear Feb 21, 2024
0f4b38a
Add types
flamingbear Feb 21, 2024
e7ea2b4
Merge pull request #7 from flamingbear/mhs/migrate-datatree-tests
flamingbear Feb 21, 2024
f2f327f
Pass mypy for 3.9
flamingbear Feb 21, 2024
9cbaf3b
Merge branch 'main' into mhs/migrate_treenode
flamingbear Feb 26, 2024
9db7040
Merge branch 'main' into mhs/migrate_treenode
flamingbear Feb 26, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 7 additions & 3 deletions doc/whats-new.rst
Original file line number Diff line number Diff line change
Expand Up @@ -49,7 +49,9 @@ Documentation

Internal Changes
~~~~~~~~~~~~~~~~

- Migrates ``treenode`` functionality into ``xarray/core`` (:pull:`8757`)
By `Matt Savoie <https://github.com/flamingbear>`_ and `Tom Nicholas
<https://github.com/TomNicholas>`_.


.. _whats-new.2024.02.0:
Expand Down Expand Up @@ -145,9 +147,11 @@ Internal Changes
``xarray/namedarray``. (:pull:`8319`)
By `Tom Nicholas <https://github.com/TomNicholas>`_ and `Anderson Banihirwe <https://github.com/andersy005>`_.
- Imports ``datatree`` repository and history into internal location. (:pull:`8688`)
By `Matt Savoie <https://github.com/flamingbear>`_ and `Justus Magin <https://github.com/keewis>`_.
By `Matt Savoie <https://github.com/flamingbear>`_, `Justus Magin <https://github.com/keewis>`_
and `Tom Nicholas <https://github.com/TomNicholas>`_.
- Adds :py:func:`open_datatree` into ``xarray/backends`` (:pull:`8697`)
By `Matt Savoie <https://github.com/flamingbear>`_.
By `Matt Savoie <https://github.com/flamingbear>`_ and `Tom Nicholas
<https://github.com/TomNicholas>`_.
- Refactor :py:meth:`xarray.core.indexing.DaskIndexingAdapter.__getitem__` to remove an unnecessary
rewrite of the indexer key (:issue: `8377`, :pull:`8758`)
By `Anderson Banihirwe <https://github.com/andersy005>`_.
Expand Down
4 changes: 2 additions & 2 deletions xarray/backends/common.py
Original file line number Diff line number Diff line change
Expand Up @@ -137,8 +137,8 @@ def _open_datatree_netcdf(
**kwargs,
) -> DataTree:
from xarray.backends.api import open_dataset
from xarray.core.treenode import NodePath
from xarray.datatree_.datatree import DataTree
from xarray.datatree_.datatree.treenode import NodePath

ds = open_dataset(filename_or_obj, **kwargs)
tree_root = DataTree.from_dict({"/": ds})
Expand All @@ -159,7 +159,7 @@ def _open_datatree_netcdf(


def _iter_nc_groups(root, parent="/"):
from xarray.datatree_.datatree.treenode import NodePath
from xarray.core.treenode import NodePath

parent = NodePath(parent)
for path, group in root.groups.items():
Expand Down
4 changes: 2 additions & 2 deletions xarray/backends/zarr.py
Original file line number Diff line number Diff line change
Expand Up @@ -1048,8 +1048,8 @@ def open_datatree(
import zarr

from xarray.backends.api import open_dataset
from xarray.core.treenode import NodePath
from xarray.datatree_.datatree import DataTree
from xarray.datatree_.datatree.treenode import NodePath

zds = zarr.open_group(filename_or_obj, mode="r")
ds = open_dataset(filename_or_obj, engine="zarr", **kwargs)
Expand All @@ -1075,7 +1075,7 @@ def open_datatree(


def _iter_zarr_groups(root, parent="/"):
from xarray.datatree_.datatree.treenode import NodePath
from xarray.core.treenode import NodePath

parent = NodePath(parent)
for path, group in root.groups():
Expand Down
100 changes: 50 additions & 50 deletions xarray/datatree_/datatree/treenode.py → xarray/core/treenode.py
Original file line number Diff line number Diff line change
@@ -1,17 +1,12 @@
from __future__ import annotations

import sys
from collections import OrderedDict
from collections.abc import Iterator, Mapping
from pathlib import PurePosixPath
from typing import (
TYPE_CHECKING,
Generic,
Iterator,
Mapping,
Optional,
Tuple,
TypeVar,
Union,
)

from xarray.core.utils import Frozen, is_dict_like
Expand All @@ -25,7 +20,7 @@ class InvalidTreeError(Exception):


class NotFoundInTreeError(ValueError):
"""Raised when operation can't be completed because one node is part of the expected tree."""
"""Raised when operation can't be completed because one node is not part of the expected tree."""


class NodePath(PurePosixPath):
Copy link
Collaborator

@keewis keewis Feb 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're exposing it (and keep it as a pure path, which I think we should), should we move this class to a separate module?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we exposing NodePath? I suggested it, but after talking with @etienneschalk decided maybe it wasn't particularly helpful xarray-contrib/datatree#205.

What's the logic for making it a pure path? The posix path is to ensure the slashes are in a consistent direction (remember this isn't a real filepath so shouldn't change automatically on windows systems).

Copy link
Collaborator

@keewis keewis Feb 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure about this, but I believe the difference is that PurePosixPath doesn't have methods that require read/write access (like touch, remove, mkdir, iterdir, ...), while PosixPath does.

The main difference between filesystems and datatree is that there's only a single filesystem instance at a time, while you can have multiple DataTree objects at the same time. This means that a NodePath inheriting from PosixPath would have to be created from the DataTree object and would require a lot of discipline to not be confusing.

On the other hand, a NodePath that inherits from PurePosixPath would help a bit with path manipulations (name, root, joinpath(), parts, parent, parents, ...) that we would otherwise have to use posixpath for. I didn't check if posixpath has the same functionality as PurePosixPath, though, and they might also not be as convenient to use.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I wrote my above comment too quickly - we both agree it should stay as a pure path.

We could also consider adding extra methods to this NodePath class if it simplifies other parts of the code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Related issue: xarray-contrib/datatree#283 (Consistency between DataTree methods and pathlib.PurePath methods)

To me the graal would be to have datatree more and more compatible with PurePosixPath (what NodePath already does by inheriting from it, but we can imagine having an API accepting both strings and PurePosixPath, without the need to expose NodePath publicly).

As @keewis mentioned , there is a single instance of the filesystem whereas there can be multiple DataTrees. So a single DataTree could act like a "small filesystem on its own" and implement the PosixPath methods.

NodePath ~ PurePosixPath
DataTree ~ PosixPath 

The idea here is to delegate to pathlib the diffcult task of choosing the best names for methods that help manipulating trees. For an audience of developers that are already used to writing scripts with pathlib, this would make working with datatree automatic! And this also gives plenty of ideas of features. Also, with hierarchical-formats such as Zarr that rely on the filesystem, openable by datatree, the proximity is innate.

Concrete example: working with multi-resolution rasters (a usecase of datatree). By using f-strings and glob, you can write resolution independant code once, like you would have done by manipulating a data structure on the file system

The schema on the pathlib doc (arrows are inheritance):

flowchart BT
  PurePosixPath --> PurePath
  Path --> PurePath
  PureWindowsPath --> PurePath
  PosixPath --> PurePosixPath
  PosixPath --> Path
  WindowsPath --> PureWindowsPath
  WindowsPath --> Path
Loading

How I see it:

flowchart BT
subgraph Pure
  PurePosixPath --> PurePath
  NodePath --> PurePosixPath
  end
subgraph Concrete
  DataTree --> NodePath
  PosixPath --> PurePosixPath
  end
Loading

(I removed Path for simplification)

I am not 100% sure it fully makes sense... The nuance is that when you instanciate a Path, it is always implicit that it is bound to the filesystem, while each DataTree instantiation creates its own data representation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @etienneschalk for all your thoughts here!

So a single DataTree could act like a "small filesystem on its own" and implement the PosixPath methods.

I am not 100% sure it fully makes sense... The nuance is that when you instanciate a Path, it is always implicit that it is bound to the filesystem, while each DataTree instantiation creates its own data representation.

I strongly feel that DataTree / NodePath objects should not be associated with concrete filesystem paths.

However this

To me the graal would be to have datatree more and more compatible with PurePosixPath (what NodePath already does by inheriting from it, but we can imagine having an API accepting both strings and PurePosixPath, without the need to expose NodePath publicly).

seems like a reasonable idea, and I think we should continue the discussion of that in xarray-contrib/datatree#283.

For now I don't think there is anything we need to do in this PR specifically - we can revisit the API choices / accepting path-like types in a future PR.

Expand Down Expand Up @@ -55,8 +50,8 @@ class TreeNode(Generic[Tree]):

This class stores no data, it has only parents and children attributes, and various methods.

Stores child nodes in an Ordered Dictionary, which is necessary to ensure that equality checks between two trees
also check that the order of child nodes is the same.
Stores child nodes in an dict, ensuring that equality checks between trees
and order of child nodes is preserved (since python 3.7).

Nodes themselves are intrinsically unnamed (do not possess a ._name attribute), but if the node has a parent you can
find the key it is stored under via the .name property.
Expand All @@ -73,15 +68,16 @@ class TreeNode(Generic[Tree]):
Also allows access to any other node in the tree via unix-like paths, including upwards referencing via '../'.

(This class is heavily inspired by the anytree library's NodeMixin class.)

"""

_parent: Optional[Tree]
_children: OrderedDict[str, Tree]
_parent: Tree | None
_children: dict[str, Tree]

def __init__(self, children: Optional[Mapping[str, Tree]] = None):
def __init__(self, children: Mapping[str, Tree] | None = None):
"""Create a parentless node."""
self._parent = None
self._children = OrderedDict()
self._children = {}
if children is not None:
self.children = children

Expand All @@ -91,7 +87,7 @@ def parent(self) -> Tree | None:
return self._parent

def _set_parent(
self, new_parent: Tree | None, child_name: Optional[str] = None
self, new_parent: Tree | None, child_name: str | None = None
) -> None:
# TODO is it possible to refactor in a way that removes this private method?

Expand Down Expand Up @@ -127,17 +123,15 @@ def _detach(self, parent: Tree | None) -> None:
if parent is not None:
self._pre_detach(parent)
parents_children = parent.children
parent._children = OrderedDict(
{
name: child
for name, child in parents_children.items()
if child is not self
}
)
parent._children = {
name: child
for name, child in parents_children.items()
if child is not self
}
self._parent = None
self._post_detach(parent)

def _attach(self, parent: Tree | None, child_name: Optional[str] = None) -> None:
def _attach(self, parent: Tree | None, child_name: str | None = None) -> None:
if parent is not None:
if child_name is None:
raise ValueError(
Expand Down Expand Up @@ -167,7 +161,7 @@ def children(self: Tree) -> Mapping[str, Tree]:
@children.setter
def children(self: Tree, children: Mapping[str, Tree]) -> None:
self._check_children(children)
children = OrderedDict(children)
children = {**children}

old_children = self.children
del self.children
Expand Down Expand Up @@ -242,7 +236,7 @@ def _iter_parents(self: Tree) -> Iterator[Tree]:
yield node
node = node.parent

def iter_lineage(self: Tree) -> Tuple[Tree, ...]:
def iter_lineage(self: Tree) -> tuple[Tree, ...]:
"""Iterate up the tree, starting from the current node."""
from warnings import warn

Expand All @@ -254,7 +248,7 @@ def iter_lineage(self: Tree) -> Tuple[Tree, ...]:
return tuple((self, *self.parents))

@property
def lineage(self: Tree) -> Tuple[Tree, ...]:
def lineage(self: Tree) -> tuple[Tree, ...]:
"""All parent nodes and their parent nodes, starting with the closest."""
from warnings import warn

Expand All @@ -266,12 +260,12 @@ def lineage(self: Tree) -> Tuple[Tree, ...]:
return self.iter_lineage()

@property
def parents(self: Tree) -> Tuple[Tree, ...]:
def parents(self: Tree) -> tuple[Tree, ...]:
"""All parent nodes and their parent nodes, starting with the closest."""
return tuple(self._iter_parents())

@property
def ancestors(self: Tree) -> Tuple[Tree, ...]:
def ancestors(self: Tree) -> tuple[Tree, ...]:
"""All parent nodes and their parent nodes, starting with the most distant."""

from warnings import warn
Expand Down Expand Up @@ -306,7 +300,7 @@ def is_leaf(self) -> bool:
return self.children == {}

@property
def leaves(self: Tree) -> Tuple[Tree, ...]:
def leaves(self: Tree) -> tuple[Tree, ...]:
"""
All leaf nodes.

Expand All @@ -315,20 +309,18 @@ def leaves(self: Tree) -> Tuple[Tree, ...]:
return tuple([node for node in self.subtree if node.is_leaf])

@property
def siblings(self: Tree) -> OrderedDict[str, Tree]:
def siblings(self: Tree) -> dict[str, Tree]:
"""
Nodes with the same parent as this node.
"""
if self.parent:
return OrderedDict(
{
name: child
for name, child in self.parent.children.items()
if child is not self
}
)
return {
name: child
for name, child in self.parent.children.items()
if child is not self
}
else:
return OrderedDict()
return {}

@property
def subtree(self: Tree) -> Iterator[Tree]:
Expand All @@ -341,12 +333,12 @@ def subtree(self: Tree) -> Iterator[Tree]:
--------
DataTree.descendants
"""
from . import iterators
from xarray.datatree_.datatree import iterators

return iterators.PreOrderIter(self)

@property
def descendants(self: Tree) -> Tuple[Tree, ...]:
def descendants(self: Tree) -> tuple[Tree, ...]:
"""
Child nodes and all their child nodes.

Expand Down Expand Up @@ -431,7 +423,7 @@ def _post_attach(self: Tree, parent: Tree) -> None:
"""Method call after attaching to `parent`."""
pass

def get(self: Tree, key: str, default: Optional[Tree] = None) -> Optional[Tree]:
def get(self: Tree, key: str, default: Tree | None = None) -> Tree | None:
"""
Return the child node with the specified key.

Expand All @@ -445,7 +437,7 @@ def get(self: Tree, key: str, default: Optional[Tree] = None) -> Optional[Tree]:

# TODO `._walk` method to be called by both `_get_item` and `_set_item`

def _get_item(self: Tree, path: str | NodePath) -> Union[Tree, T_DataArray]:
def _get_item(self: Tree, path: str | NodePath) -> Tree | T_DataArray:
"""
Returns the object lying at the given path.

Expand Down Expand Up @@ -488,24 +480,26 @@ def _set(self: Tree, key: str, val: Tree) -> None:
def _set_item(
self: Tree,
path: str | NodePath,
item: Union[Tree, T_DataArray],
item: Tree | T_DataArray,
new_nodes_along_path: bool = False,
allow_overwrite: bool = True,
) -> None:
"""
Set a new item in the tree, overwriting anything already present at that path.

The given value either forms a new node of the tree or overwrites an existing item at that location.
The given value either forms a new node of the tree or overwrites an
existing item at that location.

Parameters
----------
path
item
new_nodes_along_path : bool
If true, then if necessary new nodes will be created along the given path, until the tree can reach the
specified location.
If true, then if necessary new nodes will be created along the
given path, until the tree can reach the specified location.
allow_overwrite : bool
Whether or not to overwrite any existing node at the location given by path.
Whether or not to overwrite any existing node at the location given
by path.

Raises
------
Expand Down Expand Up @@ -580,9 +574,9 @@ class NamedNode(TreeNode, Generic[Tree]):
Implements path-like relationships to other nodes in its tree.
"""

_name: Optional[str]
_parent: Optional[Tree]
_children: OrderedDict[str, Tree]
_name: str | None
_parent: Tree | None
_children: dict[str, Tree]

def __init__(self, name=None, children=None):
super().__init__(children=children)
Expand All @@ -603,8 +597,14 @@ def name(self, name: str | None) -> None:
raise ValueError("node names cannot contain forward slashes")
self._name = name

def __repr__(self, level=0):
repr_value = "\t" * level + self.__str__() + "\n"
for child in self.children:
repr_value += self.get(child).__repr__(level + 1)
return repr_value

def __str__(self) -> str:
return f"NamedNode({self.name})" if self.name else "NamedNode()"
return f"NamedNode('{self.name}')" if self.name else "NamedNode()"

def _post_attach(self: NamedNode, parent: NamedNode) -> None:
"""Ensures child has name attribute corresponding to key under which it has been stored."""
Expand Down
2 changes: 1 addition & 1 deletion xarray/datatree_/datatree/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
from .datatree import DataTree
from .extensions import register_datatree_accessor
from .mapping import TreeIsomorphismError, map_over_subtree
from .treenode import InvalidTreeError, NotFoundInTreeError
from xarray.core.treenode import InvalidTreeError, NotFoundInTreeError


__all__ = (
Expand Down
2 changes: 1 addition & 1 deletion xarray/datatree_/datatree/datatree.py
Original file line number Diff line number Diff line change
Expand Up @@ -50,7 +50,7 @@
MappedDataWithCoords,
)
from .render import RenderTree
from .treenode import NamedNode, NodePath, Tree
from xarray.core.treenode import NamedNode, NodePath, Tree

try:
from xarray.core.variable import calculate_dimensions
Expand Down
2 changes: 1 addition & 1 deletion xarray/datatree_/datatree/iterators.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
from collections import abc
from typing import Callable, Iterator, List, Optional

from .treenode import Tree
from xarray.core.treenode import Tree

"""These iterators are copied from anytree.iterators, with minor modifications."""

Expand Down
4 changes: 2 additions & 2 deletions xarray/datatree_/datatree/mapping.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,10 +9,10 @@
from xarray import DataArray, Dataset

from .iterators import LevelOrderIter
from .treenode import NodePath, TreeNode
from xarray.core.treenode import NodePath, TreeNode

if TYPE_CHECKING:
from .datatree import DataTree
from xarray.core.datatree import DataTree


class TreeIsomorphismError(ValueError):
Expand Down
6 changes: 3 additions & 3 deletions xarray/datatree_/datatree/tests/test_formatting.py
Original file line number Diff line number Diff line change
Expand Up @@ -108,13 +108,13 @@ def test_diff_node_data(self):
Data in nodes at position '/a' do not match:

Data variables only on the left object:
v int64 1
v int64 8B 1

Data in nodes at position '/a/b' do not match:

Differing data variables:
L w int64 5
R w int64 6"""
L w int64 8B 5
R w int64 8B 6"""
)
actual = diff_tree_repr(dt_1, dt_2, "equals")
assert actual == expected
Loading
Loading