Operations on DataTree objects should not create duplicate coordinates on sub-trees #9475

shoyer · 2024-09-10T23:27:49Z

What is your issue?

This is very obviously an issue using the repr from #9470:

In [20]: tree = DataTree(Dataset(coords={'x': [1, 2, 3]}))

In [21]: tree['/foo'] = DataTree(Dataset({'bar': ('x', [4, 5, 6])}))

In [22]: tree
Out[22]:
<xarray.DataTree>
Group: /
│   Dimensions:  (x: 3)
│   Coordinates:
│     * x        (x) int64 24B 1 2 3
└── Group: /foo
        Dimensions:  (x: 3)
        Data variables:
            bar      (x) int64 24B 4 5 6

In [23]: tree * 2
Out[23]:
<xarray.DataTree>
Group: /
│   Dimensions:  (x: 3)
│   Coordinates:
│     * x        (x) int64 24B 1 2 3
└── Group: /foo
        Dimensions:  (x: 3)
        Coordinates:
          * x        (x) int64 24B 1 2 3
        Data variables:
            bar      (x) int64 24B 8 10 12

Instead, the result for tree * 2 should look like:

<xarray.DataTree>
Group: /
│   Dimensions:  (x: 3)
│   Coordinates:
│     * x        (x) int64 24B 1 2 3
└── Group: /foo
        Dimensions:  (x: 3)
        Data variables:
            bar      (x) int64 24B 8 10 12

We probably also need to also revisit other Dataset methods that are ported to DataTree via map_over_subtree (xref #9472). Some of these (e.g., arithmetic, aggregations) can likely easily be corrected simply by mapping over nodes with inherited=False. Others (e.g., indexing) will need more careful consideration.

The text was updated successfully, but these errors were encountered:

shoyer · 2024-09-10T23:39:25Z

As a starting point, it might make sense to disable most automatically ported DataTree methods, at least those that can't be implemented via map_over_subtree with inherited=False.

TomNicholas · 2024-09-11T00:16:19Z

I'm kind of surprised that the existing map_over_subtree tests didn't catch this.

shoyer · 2024-09-11T00:27:15Z

I'm kind of surprised that the existing map_over_subtree tests didn't catch this.

This could be because assert_identical (prior to #9473) doesn't currently check this.

TomNicholas · 2024-09-11T16:54:29Z

In the meeting just now we decided that this is a deficiency in the data model, not just in the implementation of map_over_subtree. We should change the model so that identical coordinates in children are automatically removed, so that the above would be automatically de-duplicated at construction time.

This is important to do as otherwise almost every user will run into this problem, especially using this pattern:

dt['lat'] = ...  # coord present on root

ds = dt[path].dataset
result_ds = some_operation(ds)
dt[path].dataset = result_ds  # coord now present on both root and descendant!

Currently this would duplicate the 'lat' coordinate between dt and dt[path], which would lead to very confusing behaviour, e.g.

del dt[path]['lat']
'lat' in dt[path]  # would return true!

(because only the duplicated coordinate on the child would have been deleted, not the original on the root, which would still be inherited).

IO

Disallowing duplicated coordinates in general would subtly affect round-tripping behaviour. It would no longer be possible to write out a netCDF/Zarr file with duplicated coordinates using DataTree, because DataTree could not represent them in the first place.

Round-tripping is then broken for any file with duplicated coordinates, as reading the file would de-duplicate the coordinates, and the information wouldn't be propagated in order to re-duplicate them before writing.

Mostly this wouldn't matter often - there is not much reason to define duplicated coordinates in netCDF/Zarr in the first place.

It might lead to subtle trickiness for files containing CF metadata that refers to other variables in different groups using absolute paths, but that is not something that xarray promises to handle anyway, as interpreting metadata contents is not part of xarray's data model (beyond what happens during decoding).

However again open_groups could still open such files without removing duplicates, and one could imagine writing out group-by-group to a single output file, and that should roundtrip perfectly.

All of this is a consequence of the fact that whilst very similar, the data models of netCDF, Zarr, and DataTree are not identical.

cc @owenlittlejohns @flamingbear @eni-awowale

Not sure exactly what the internal implementation of de-deduplication would be (perhaps some kind of NonDuplicatedChainMap?), but @shoyer said he is on it 🙂 A small change to map_over_subtree might also still be required to close this issue.

shoyer · 2024-09-12T00:45:47Z

It should be straightforward enough to handle duplicated index coordinates. Non-identical coordinates with an index are already disallowed between parent and child nodes, because the parent and child nodes would fail the alignment check in _pre_attach.

However, there is one tricky case: duplicated coordinates without an associated index. Here's an example of how these currently manifest themselves:

In [33]: tree = DataTree(Dataset(coords={'x': 0}), children={'child': DataTree(Dataset({'y': 1}))})

In [34]: tree
Out[34]:
<xarray.DataTree>
Group: /
│   Dimensions:  ()
│   Coordinates:
│       x        int64 8B 0
└── Group: /child
        Dimensions:  ()
        Data variables:
            y        int64 8B 1

In [35]: tree.map_over_subtree(lambda ds: ds)
Out[35]:
<xarray.DataTree>
Group: /
│   Dimensions:  ()
│   Coordinates:
│       x        int64 8B 0
└── Group: /child
        Dimensions:  ()
        Coordinates:
            x        int64 8B 0
        Data variables:
            y        int64 8B 1

Notice the duplicated coordinates /x and /child/x.

For consistency with the rest of Xarray, we should try to handle this like duplicate coordinates that result from other Xarray operations like arithmetic. As noted in #9481, the current strategy requires evaluating coordinate values to see if they are equal or not in order to decide where they appear. This is attractive for DataTree as well, but unfortunately is not compatible with lazy evaluation.

Instead, I propose that we switch to the equivalent of compat='override, similar to the proposal for Dataset/DataArray coordinates proposed in #9481. On DataTree, if an operation results in a node with a (non-index) coordinate with the same name as a parent coordinate, the child coordinate should be silently dropped.

An exception to this rule are operations that explicitly assign a child coordinate, e.g., DataTree.coords.__setitem__. Here we should raise an error (noting the existence of the parent coordinate) rather than being a silent no-op.

eni-awowale · 2024-09-12T17:01:42Z

Thanks for this great write up! I think in the long run this will probably help to avoid confusion and is probably the best way forward. But from more of a DAAC archiving, service provider and data validation perspective (lol I wear a lot of hats) this makes me a little nervous. At GES DISC we use DataTree to do data validation checks against server hosted services and cloud hosted services. The part that makes me nervous is that using DataTree to open netCDF4 files would be modifying the orginal file by removing the duplicate coordinate variables at each node. I don't think this will actually break anything in the backend for us but there is some complexity especialy with metadata that was mentioned.

Would folks be opposed to some kind of flag that's like inherit=True or de_duplicate=True? I think we could set that to being the default behavior and give the user an option to turn it off.

TomNicholas · 2024-09-12T17:14:39Z

Would folks be opposed to some kind of flag that's like inherit=True or de_duplicate=True? I think we could set that to being the default behavior and give the user an option to turn it off.

The proposed change here would make such a flag impossible - your intended result of de_duplicate=False would now be forbidden by the DataTree data model.

What we could perhaps do though is add a flag to constructors / openers that would raise on encountering duplicated coordinates instead of silently de-duplicating them. e.g.

dt = open_datatree(unvalidated.nc, duplicate_coords='raise')

where the error message tells you which coordinates are duplicated, and refers you to open_groups. The default could still be 'ignore' or maybe 'warn'. There's precedent for treating potential errors this way, see for example

xarray/xarray/namedarray/core.py

Line 1001 in aeaa082

missing_dims: ErrorOptionsWithWarn = "raise",

shoyer · 2024-09-12T18:17:05Z

What we could perhaps do though is add a flag to constructors / openers that would raise on encountering duplicated coordinates instead of silently de-duplicating them. e.g.

Yes, this is a great idea!

I would actually suggest such a stricter mode for open_datatree by default. Users who write data from Xarray will not encounter the issue, because Xarray will not be able to represent such datasets. So it's only something that will effect loading data from an external data provider, and it seems better to force users to make an intentional choice in such cases.

The cases where we should be more lenient by default (automatically dropping conflicts) are situations where users are constructing a new DataTree from a collection of Dataset objects, which were likely created using inherited=True.

TomNicholas · 2024-09-12T18:48:56Z

If we add the flag kwarg to DataTree.from_dict() then it can just be passed down from open_datatree immediately, and we can also explicitly write DataTree.from_dict(..., duplicate_coords='ignore') inside map_over_subtree to make our intent clear internally 🙂

TomNicholas · 2024-09-17T21:55:50Z

So summarising another meeting's worth of discussion on this... (including special guest @castelao)

The de-duplication idea has some issues.

map_over_subtree vs DataTree.__setitem__

As implied above, any de-duplication should ideally occur in the data model itself, not as a special feature of map_over_subtree. That means that the de-duplication also has to be applied for DataTree.__setitem__.

De-duplication vs Overriding

Comparing two coordinate variables to decide if they are de-duplicated could just be done by comparing names, as in #9510 (comment). But removing anything of the same name (and doing so in __setitem__) basically completely forbids "overriding" inherited coordinates with a new coordinate of the same name but different values. This would be a valid data model for DataTree, but an even more restrictive one than what we currently have (the ChainMap implementation we currently have allows for overriding any coordinate).

De-duplicating lazy variables by id

If you don't compare by names you have to compare something else. You could compare the id of the underlying Variable, i.e. id(Variable), but this will not recognize copies as being duplicates, even if their values are the same. e.g.

def mean(ds: Dataset) -> Dataset:
    return ds.mean()  # this creates a shallow copy of all the variables!

dt.map_over_subtree(mean)  # this would therefore still end up with duplicated coordinates

Only inherit index variables?

You can instead identify duplicated coordinates by comparing values directly, but this implies loading the variable into memory. Mostly our coordinate variables will be backed by in-memory indexes, so we can compare those and all would be fine. So a data model of "only inherit index-backed coordinates" works quite nicely, in that you can then always cheaply do comparison of inherited coordinates to check for de-duplication, and you can distinguish overridden from duplicated coordinates.

Accessing non-indexed inherited variables?

The fly in that ointment is that it's possible to have coordinate variables that are not backed by indexes, and as these can still be multi-dimensional they can still lazily point to large amounts of data.

If the data model is now "only inherit coordinate variables backed by indexes", then it limits the usefulness of inheritance.

If we try to do inheritance of non-indexed coordinate variables, we can't use the solution from step (4).

kmuehlbauer · 2024-09-18T05:55:53Z

My only concern is what @eni-awowale already mentioned, removing duplicated coordinates from child-groups will break the original file data model. Overriding coordinates in child groups is also a CF conventions feature, it should at least be handled correctly via some switch/flag.

This is a _partial_ solution to the duplicate coordinates issue from pydata#9475. Here we remove all duplicate coordinates between parent and child nodes with an index (these are already checked for equality via the alignment check). Other repeated coordinates (which we cannot automatically check for equality) are still allowed for now. We will need an alternative solution for these, as discussed in pydata#9475, but it is less obvious what the right solution is so I'm holding off on it for now.

This is option (4) from pydata#9475 (comment)

shoyer added needs triage Issue that has not been reviewed by xarray team member bug topic-DataTree Related to the implementation of a DataTree class and removed needs triage Issue that has not been reviewed by xarray team member labels Sep 10, 2024

shoyer changed the title ~~Arithmetic on DataTree objects should not create duplicate coordinates on sub-trees~~ Operations on DataTree objects should not create duplicate coordinates on sub-trees Sep 10, 2024

shoyer mentioned this issue Sep 12, 2024

Assigning to DataTree.coords should support full paths #9485

Open

This was referenced Sep 12, 2024

DAS-2155 - Merge datatree documentation into main docs. #9033

Merged

Track merging datatree into xarray #8572

Open

TomNicholas mentioned this issue Sep 15, 2024

DataTree repr showing inherited coordinates as non-inherited #9499

Closed

shoyer mentioned this issue Sep 17, 2024

De-duplicate inherited coordinates & dimensions #9510

Draft

3 tasks

shoyer mentioned this issue Sep 22, 2024

Remove duplicate coordinates with indexes on sub-trees #9531

Merged

1 task

TomNicholas added a commit to TomNicholas/xarray that referenced this issue Sep 23, 2024

add todo about pydata#9475

d779e22

shoyer added a commit to shoyer/xarray that referenced this issue Sep 28, 2024

Stop inheriting non-indexed coordinates for DataTree

e8a3c35

This is option (4) from pydata#9475 (comment)

shoyer mentioned this issue Sep 28, 2024

Stop inheriting non-indexed coordinates for DataTree #9555

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Operations on DataTree objects should not create duplicate coordinates on sub-trees #9475

Operations on DataTree objects should not create duplicate coordinates on sub-trees #9475

shoyer commented Sep 10, 2024 •

edited

Loading

shoyer commented Sep 10, 2024

TomNicholas commented Sep 11, 2024

shoyer commented Sep 11, 2024

TomNicholas commented Sep 11, 2024 •

edited

Loading

shoyer commented Sep 12, 2024

eni-awowale commented Sep 12, 2024

TomNicholas commented Sep 12, 2024

shoyer commented Sep 12, 2024

TomNicholas commented Sep 12, 2024 •

edited

Loading

TomNicholas commented Sep 17, 2024

kmuehlbauer commented Sep 18, 2024 •

edited

Loading

Operations on DataTree objects should not create duplicate coordinates on sub-trees #9475

Operations on DataTree objects should not create duplicate coordinates on sub-trees #9475

Comments

shoyer commented Sep 10, 2024 • edited Loading

What is your issue?

shoyer commented Sep 10, 2024

TomNicholas commented Sep 11, 2024

shoyer commented Sep 11, 2024

TomNicholas commented Sep 11, 2024 • edited Loading

IO

shoyer commented Sep 12, 2024

eni-awowale commented Sep 12, 2024

TomNicholas commented Sep 12, 2024

shoyer commented Sep 12, 2024

TomNicholas commented Sep 12, 2024 • edited Loading

TomNicholas commented Sep 17, 2024

kmuehlbauer commented Sep 18, 2024 • edited Loading

shoyer commented Sep 10, 2024 •

edited

Loading

TomNicholas commented Sep 11, 2024 •

edited

Loading

TomNicholas commented Sep 12, 2024 •

edited

Loading

kmuehlbauer commented Sep 18, 2024 •

edited

Loading