Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to treat name of root node? #81

Closed
TomNicholas opened this issue Apr 29, 2022 · 1 comment
Closed

How to treat name of root node? #81

TomNicholas opened this issue Apr 29, 2022 · 1 comment
Labels
design question IO Representation of particular file formats as trees

Comments

@TomNicholas
Copy link
Collaborator

TomNicholas commented Apr 29, 2022

In #76 I refactored the tree structure to use a path-like syntax. This includes referring to the root of a tree as "/", same as in cd / in a unix-like filesystem.

This makes accessing nodes and variables of nodes quite neat, because you can reference nodes via absolute or relative paths:

In [23]: from datatree.tests.test_datatree import create_test_datatree

In [24]: dt = create_test_datatree()

In [25]: dt['set2/a']
Out[25]: 
<xarray.DataArray 'a' (x: 2)>
array([2, 3])
Dimensions without coordinates: x

In [26]: dt['/set2/a']
Out[26]: 
<xarray.DataArray 'a' (x: 2)>
array([2, 3])
Dimensions without coordinates: x

In [27]: dt['./set2/a']
Out[27]: 
<xarray.DataArray 'a' (x: 2)>
array([2, 3])
Dimensions without coordinates: x

This refactor also made DataTree objects only optionally have a name, as opposed to be before when they were required to have a name. (They still have a .name attribute now, it just can be None.)

In [28]: dt.name

Normally this doesn't matter, because when assigned a .parent a node's .name property will just point to the key under which it is stored as a child. This echoes the way an unnamed DataArray can be stored in a Dataset.

In [29]: import xarray as xr

In [30]: ds = xr.Dataset()

In [31]: da = xr.DataArray(0)

In [32]: ds['foo'] = da

In [33]: ds['foo'].name
Out[33]: 'foo'

However this means that the root node of a tree is no longer required to have a name in general.


This is good because

  • As a user you normally don't care about the name of the root when manipulating the tree, only the names of the nodes,

  • It makes the __init__ signature simpler as name is no longer a required arg,

  • It most closely echoes how filepaths work (the filesystem root "/" doesn't have another name),

  • Roundtripping from Zarr/netCDF files still seems to work (see test_io.py),

  • Roundtripping from dictionaries still works if the root node is unnamed

    In [35]: d = {node.path: node.ds for node in dt.subtree}
    
    In [36]: roundtrip = DataTree.from_dict(d)
    
    In [37]: roundtrip
    Out[37]: 
    DataTree('None', parent=None)
    │   Dimensions:  (y: 3, x: 2)
    │   Dimensions without coordinates: y, xData variables:
    │       a        (y) int64 6 7 8set0     (x) int64 9 10
    ├── DataTree('set1')
    │   │   Dimensions:  ()
    │   │   Data variables:
    │   │       a        int64 0
    │   │       b        int64 1
    │   ├── DataTree('set1')
    │   └── DataTree('set2')
    ├── DataTree('set2')
    │   │   Dimensions:  (x: 2)
    │   │   Dimensions without coordinates: x
    │   │   Data variables:
    │   │       a        (x) int64 2 3
    │   │       b        (x) float64 0.1 0.2
    │   └── DataTree('set1')
    └── DataTree('set3')
    
    In [38]: dt.equals(roundtrip)
    Out[38]: True

But it's bad because

  • Roundtripping from dictionaries doesn't work anymore if the root node is named

    In [39]: dt2 = dt
    
    In [40]: dt2.name = "root"
    
    In [41]: d2 = {node.path: node.ds for node in dt2.subtree}
    
    In [42]: roundtrip2 = DataTree.from_dict(d2)
    
    In [43]: roundtrip2
    Out[43]: 
    DataTree('None', parent=None)
    │   Dimensions:  (y: 3, x: 2)
    │   Dimensions without coordinates: y, xData variables:
    │       a        (y) int64 6 7 8set0     (x) int64 9 10
    ├── DataTree('set1')
    │   │   Dimensions:  ()
    │   │   Data variables:
    │   │       a        int64 0
    │   │       b        int64 1
    │   ├── DataTree('set1')
    │   └── DataTree('set2')
    ├── DataTree('set2')
    │   │   Dimensions:  (x: 2)
    │   │   Dimensions without coordinates: x
    │   │   Data variables:
    │   │       a        (x) int64 2 3
    │   │       b        (x) float64 0.1 0.2
    │   └── DataTree('set1')
    └── DataTree('set3')
    
    In [44]: dt2.equals(roundtrip2)
    Out[44]: False
  • The signature of the DataTree.from_dict becomes a bit weird because if you want to name the root node the only way to do it is to pass a separate name argument, i.e.

    In [45]: dt3 = DataTree.from_dict(d, name='root')
    
    In [46]: dt3
    Out[46]: 
    DataTree('root', parent=None)
    ├── DataTree('set1')
    │   │   Dimensions:  ()
    │   │   Data variables:
    │   │       a        int64 0
    │   │       b        int64 1
    │   ├── DataTree('set1')
    │   └── DataTree('set2')
    ├── DataTree('set2')
    │   │   Dimensions:  (x: 2)
    │   │   Dimensions without coordinates: x
    │   │   Data variables:
    │   │       a        (x) int64 2 3
    │   │       b        (x) float64 0.1 0.2
    │   └── DataTree('set1')
    └── DataTree('set3')

What do we think about this behaviour? Does this seem like a good design, or annoyingly finicky?

@jhamman I notice that in the code you wrote for the io you put a note about not being able to specify a root group for the tree. Is that related to this question? Do you have any other thoughts on this?

@TomNicholas TomNicholas added design question IO Representation of particular file formats as trees labels Apr 29, 2022
@TomNicholas TomNicholas mentioned this issue Apr 29, 2022
4 tasks
@jhamman
Copy link

jhamman commented May 3, 2022

@jhamman I notice that in the code you wrote for the io you put a note about not being able to specify a root group for the tree. Is that related to this question? Do you have any other thoughts on this?

I believe my comment was referring to supplying the root group when writing a datatree such that the child pahts are prepended with the group id (i.e. dt.to_netcdf('foo.nc', group='/foo/bar/')). I don't think there is anything that kept me from implementing that feature apart from my goal of an MVP at the time. I also think the changes in #76 (and your description above) will work with this feature if or when someone implements it. (tldr; I don't think there is a problem here)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
design question IO Representation of particular file formats as trees
Projects
None yet
Development

No branches or pull requests

2 participants