Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrate treenode module. #8757

Merged
merged 25 commits into from
Feb 27, 2024
Merged

Conversation

flamingbear
Copy link
Member

@flamingbear flamingbear commented Feb 15, 2024

Migrate datatree's treenode module the base class representing node of a tree, with methods for accessing nodes and traversal.

  • completes step 2 datatree/treenode.py Track merging datatree into xarray #8572
  • Tests added or updated
  • [N/A] User visible changes (including notable bug fixes) are documented in whats-new.rst
  • Internal Changes (including notable bug fixes) are documented in whats-new.rst
  • [N/A] New functions/methods are listed in api.rst

flamingbear and others added 6 commits February 15, 2024 14:18
PR (#8702) added nbytes representation in DataArrays and Dataset repr, this adds it to the datatree tests.
Moves treenode.py and test_treenode.py.
Updates some typing.
Updates imports from treenode.
Add test tree structure for easier understanding.
There must be a better way, but I don't know it.
particularly the list comprehension casts.
@TomNicholas TomNicholas added the topic-DataTree Related to the implementation of a DataTree class label Feb 15, 2024
This test was broken becuase only the root node was being tested and none of
the previous nodes were represented in the __str__.
doc/whats-new.rst Outdated Show resolved Hide resolved
@flamingbear flamingbear marked this pull request as ready for review February 16, 2024 18:04
@TomNicholas
Copy link
Member

TomNicholas commented Feb 16, 2024

Thanks @flamingbear . I am just wondering if you have any overall thoughts on the design / implementation here? e.g. the TreeNode -> NamedNode -> DataTree inheritance hierarchy, or the way that the tree is constructed as connections between individual class instances (rather than e.g. representing the entire tree internally via a single nested dict, for example).

EDIT: If you think it's all fine then that's great, I'm just trying to create an explicit opportunity for you to ask "why didn't you do it this other way?"

@flamingbear
Copy link
Member Author

overall thoughts on the design / implementation here?

So my overall thoughts as I was moving it / grokking it was that the separation of concerns between the movement, linking of nodes and parenting rules vs the finding by names and paths as pretty clear. The hooks for before and after attaching children seem thoughtful. I haven't dived deep into the datatree code yet but this all makes sense so far. If you were to scrap this for a nested dictionary, wouldn't you end up with pretty complicated code to handle moving/adding nodes? My gut says this is a solid choice to represent tree data.

@@ -25,7 +21,7 @@ class InvalidTreeError(Exception):


class NotFoundInTreeError(ValueError):
"""Raised when operation can't be completed because one node is part of the expected tree."""
"""Raised when operation can't be completed because one node is not part of the expected tree."""


class NodePath(PurePosixPath):
Copy link
Collaborator

@keewis keewis Feb 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're exposing it (and keep it as a pure path, which I think we should), should we move this class to a separate module?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we exposing NodePath? I suggested it, but after talking with @etienneschalk decided maybe it wasn't particularly helpful xarray-contrib/datatree#205.

What's the logic for making it a pure path? The posix path is to ensure the slashes are in a consistent direction (remember this isn't a real filepath so shouldn't change automatically on windows systems).

Copy link
Collaborator

@keewis keewis Feb 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure about this, but I believe the difference is that PurePosixPath doesn't have methods that require read/write access (like touch, remove, mkdir, iterdir, ...), while PosixPath does.

The main difference between filesystems and datatree is that there's only a single filesystem instance at a time, while you can have multiple DataTree objects at the same time. This means that a NodePath inheriting from PosixPath would have to be created from the DataTree object and would require a lot of discipline to not be confusing.

On the other hand, a NodePath that inherits from PurePosixPath would help a bit with path manipulations (name, root, joinpath(), parts, parent, parents, ...) that we would otherwise have to use posixpath for. I didn't check if posixpath has the same functionality as PurePosixPath, though, and they might also not be as convenient to use.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I wrote my above comment too quickly - we both agree it should stay as a pure path.

We could also consider adding extra methods to this NodePath class if it simplifies other parts of the code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Related issue: xarray-contrib/datatree#283 (Consistency between DataTree methods and pathlib.PurePath methods)

To me the graal would be to have datatree more and more compatible with PurePosixPath (what NodePath already does by inheriting from it, but we can imagine having an API accepting both strings and PurePosixPath, without the need to expose NodePath publicly).

As @keewis mentioned , there is a single instance of the filesystem whereas there can be multiple DataTrees. So a single DataTree could act like a "small filesystem on its own" and implement the PosixPath methods.

NodePath ~ PurePosixPath
DataTree ~ PosixPath 

The idea here is to delegate to pathlib the diffcult task of choosing the best names for methods that help manipulating trees. For an audience of developers that are already used to writing scripts with pathlib, this would make working with datatree automatic! And this also gives plenty of ideas of features. Also, with hierarchical-formats such as Zarr that rely on the filesystem, openable by datatree, the proximity is innate.

Concrete example: working with multi-resolution rasters (a usecase of datatree). By using f-strings and glob, you can write resolution independant code once, like you would have done by manipulating a data structure on the file system

The schema on the pathlib doc (arrows are inheritance):

flowchart BT
  PurePosixPath --> PurePath
  Path --> PurePath
  PureWindowsPath --> PurePath
  PosixPath --> PurePosixPath
  PosixPath --> Path
  WindowsPath --> PureWindowsPath
  WindowsPath --> Path
Loading

How I see it:

flowchart BT
subgraph Pure
  PurePosixPath --> PurePath
  NodePath --> PurePosixPath
  end
subgraph Concrete
  DataTree --> NodePath
  PosixPath --> PurePosixPath
  end
Loading

(I removed Path for simplification)

I am not 100% sure it fully makes sense... The nuance is that when you instanciate a Path, it is always implicit that it is bound to the filesystem, while each DataTree instantiation creates its own data representation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @etienneschalk for all your thoughts here!

So a single DataTree could act like a "small filesystem on its own" and implement the PosixPath methods.

I am not 100% sure it fully makes sense... The nuance is that when you instanciate a Path, it is always implicit that it is bound to the filesystem, while each DataTree instantiation creates its own data representation.

I strongly feel that DataTree / NodePath objects should not be associated with concrete filesystem paths.

However this

To me the graal would be to have datatree more and more compatible with PurePosixPath (what NodePath already does by inheriting from it, but we can imagine having an API accepting both strings and PurePosixPath, without the need to expose NodePath publicly).

seems like a reasonable idea, and I think we should continue the discussion of that in xarray-contrib/datatree#283.

For now I don't think there is anything we need to do in this PR specifically - we can revisit the API choices / accepting path-like types in a future PR.

@TomNicholas
Copy link
Member

TomNicholas commented Feb 19, 2024 via email

Question is I did update below the released line to give Tom some credit.  I
hope that's is allowable.
doc/whats-new.rst Outdated Show resolved Hide resolved
@flamingbear
Copy link
Member Author

So that failure looks like a common one as opposed to something from this PR. Is there anything specific to address in the PR before it can be merged?

@TomNicholas
Copy link
Member

But whether test_treenode.py lives in a subdirectory or not just doesn't really matter
I think.

We made an executive decision in the datatree meeting today to not bother with subdirectories (for tests or code). We can always reorganise this later if we want to.

@TomNicholas
Copy link
Member

Is there anything specific to address in the PR before it can be merged?

I don't think so - @keewis what say you?

@TomNicholas TomNicholas merged commit dfdd631 into pydata:main Feb 27, 2024
29 checks passed
@flamingbear flamingbear deleted the mhs/migrate_treenode branch February 28, 2024 22:02
dcherian added a commit to dcherian/xarray that referenced this pull request Mar 15, 2024
* main: (31 commits)
  correctly encode/decode _FillValues/missing_values/dtypes for packed data (pydata#8713)
  Expand use of `.oindex` and `.vindex` (pydata#8790)
  Return a dataclass from Grouper.factorize (pydata#8777)
  [skip-ci] Fix upstream-dev env (pydata#8839)
  Add dask-expr for windows envs (pydata#8837)
  [skip-ci] Add dask-expr dependency to doc.yml (pydata#8835)
  Add `dask-expr` to environment-3.12.yml (pydata#8827)
  Make list_chunkmanagers more resilient to broken entrypoints (pydata#8736)
  Do not attempt to broadcast when global option ``arithmetic_broadcast=False`` (pydata#8784)
  try to get the `upstream-dev` CI to complete again (pydata#8823)
  Bump the actions group with 1 update (pydata#8818)
  Update documentation for clarity (pydata#8817)
  DOC: link to zarr.convenience.consolidate_metadata (pydata#8816)
  Refactor Grouper objects (pydata#8776)
  Grouper object design doc (pydata#8510)
  Bump the actions group with 2 updates (pydata#8804)
  tokenize() should ignore difference between None and {} attrs (pydata#8797)
  fix: remove Coordinate from __all__ in xarray/__init__.py (pydata#8791)
  Fix non-nanosecond casting behavior for `expand_dims` (pydata#8782)
  Migrate treenode module. (pydata#8757)
  ...
dcherian added a commit to dcherian/xarray that referenced this pull request Mar 16, 2024
* main: (42 commits)
  correctly encode/decode _FillValues/missing_values/dtypes for packed data (pydata#8713)
  Expand use of `.oindex` and `.vindex` (pydata#8790)
  Return a dataclass from Grouper.factorize (pydata#8777)
  [skip-ci] Fix upstream-dev env (pydata#8839)
  Add dask-expr for windows envs (pydata#8837)
  [skip-ci] Add dask-expr dependency to doc.yml (pydata#8835)
  Add `dask-expr` to environment-3.12.yml (pydata#8827)
  Make list_chunkmanagers more resilient to broken entrypoints (pydata#8736)
  Do not attempt to broadcast when global option ``arithmetic_broadcast=False`` (pydata#8784)
  try to get the `upstream-dev` CI to complete again (pydata#8823)
  Bump the actions group with 1 update (pydata#8818)
  Update documentation for clarity (pydata#8817)
  DOC: link to zarr.convenience.consolidate_metadata (pydata#8816)
  Refactor Grouper objects (pydata#8776)
  Grouper object design doc (pydata#8510)
  Bump the actions group with 2 updates (pydata#8804)
  tokenize() should ignore difference between None and {} attrs (pydata#8797)
  fix: remove Coordinate from __all__ in xarray/__init__.py (pydata#8791)
  Fix non-nanosecond casting behavior for `expand_dims` (pydata#8782)
  Migrate treenode module. (pydata#8757)
  ...
@TomNicholas TomNicholas mentioned this pull request Apr 9, 2024
27 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic-DataTree Related to the implementation of a DataTree class
Projects
Development

Successfully merging this pull request may close these issues.

5 participants