Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Operations on DataTree objects should not create duplicate coordinates on sub-trees #9475

Open
shoyer opened this issue Sep 10, 2024 · 11 comments
Labels
bug topic-DataTree Related to the implementation of a DataTree class

Comments

@shoyer
Copy link
Member

shoyer commented Sep 10, 2024

What is your issue?

This is very obviously an issue using the repr from #9470:

In [20]: tree = DataTree(Dataset(coords={'x': [1, 2, 3]}))

In [21]: tree['/foo'] = DataTree(Dataset({'bar': ('x', [4, 5, 6])}))

In [22]: tree
Out[22]:
<xarray.DataTree>
Group: /
│   Dimensions:  (x: 3)
│   Coordinates:
│     * x        (x) int64 24B 1 2 3
└── Group: /foo
        Dimensions:  (x: 3)
        Data variables:
            bar      (x) int64 24B 4 5 6

In [23]: tree * 2
Out[23]:
<xarray.DataTree>
Group: /
│   Dimensions:  (x: 3)
│   Coordinates:
│     * x        (x) int64 24B 1 2 3
└── Group: /foo
        Dimensions:  (x: 3)
        Coordinates:
          * x        (x) int64 24B 1 2 3
        Data variables:
            bar      (x) int64 24B 8 10 12

Instead, the result for tree * 2 should look like:

<xarray.DataTree>
Group: /
│   Dimensions:  (x: 3)
│   Coordinates:
│     * x        (x) int64 24B 1 2 3
└── Group: /foo
        Dimensions:  (x: 3)
        Data variables:
            bar      (x) int64 24B 8 10 12

We probably also need to also revisit other Dataset methods that are ported to DataTree via map_over_subtree (xref #9472). Some of these (e.g., arithmetic, aggregations) can likely easily be corrected simply by mapping over nodes with inherited=False. Others (e.g., indexing) will need more careful consideration.

@shoyer shoyer added needs triage Issue that has not been reviewed by xarray team member bug topic-DataTree Related to the implementation of a DataTree class and removed needs triage Issue that has not been reviewed by xarray team member labels Sep 10, 2024
@shoyer shoyer changed the title Arithmetic on DataTree objects should not create duplicate coordinates on sub-trees Operations on DataTree objects should not create duplicate coordinates on sub-trees Sep 10, 2024
@shoyer
Copy link
Member Author

shoyer commented Sep 10, 2024

As a starting point, it might make sense to disable most automatically ported DataTree methods, at least those that can't be implemented via map_over_subtree with inherited=False.

@TomNicholas
Copy link
Contributor

I'm kind of surprised that the existing map_over_subtree tests didn't catch this.

@shoyer
Copy link
Member Author

shoyer commented Sep 11, 2024

I'm kind of surprised that the existing map_over_subtree tests didn't catch this.

This could be because assert_identical (prior to #9473) doesn't currently check this.

@TomNicholas
Copy link
Contributor

TomNicholas commented Sep 11, 2024

In the meeting just now we decided that this is a deficiency in the data model, not just in the implementation of map_over_subtree. We should change the model so that identical coordinates in children are automatically removed, so that the above would be automatically de-duplicated at construction time.

This is important to do as otherwise almost every user will run into this problem, especially using this pattern:

dt['lat'] = ...  # coord present on root

ds = dt[path].dataset
result_ds = some_operation(ds)
dt[path].dataset = result_ds  # coord now present on both root and descendant!

Currently this would duplicate the 'lat' coordinate between dt and dt[path], which would lead to very confusing behaviour, e.g.

del dt[path]['lat']
'lat' in dt[path]  # would return true!

(because only the duplicated coordinate on the child would have been deleted, not the original on the root, which would still be inherited).

IO

Disallowing duplicated coordinates in general would subtly affect round-tripping behaviour. It would no longer be possible to write out a netCDF/Zarr file with duplicated coordinates using DataTree, because DataTree could not represent them in the first place.

Round-tripping is then broken for any file with duplicated coordinates, as reading the file would de-duplicate the coordinates, and the information wouldn't be propagated in order to re-duplicate them before writing.

Mostly this wouldn't matter often - there is not much reason to define duplicated coordinates in netCDF/Zarr in the first place.

It might lead to subtle trickiness for files containing CF metadata that refers to other variables in different groups using absolute paths, but that is not something that xarray promises to handle anyway, as interpreting metadata contents is not part of xarray's data model (beyond what happens during decoding).

However again open_groups could still open such files without removing duplicates, and one could imagine writing out group-by-group to a single output file, and that should roundtrip perfectly.

All of this is a consequence of the fact that whilst very similar, the data models of netCDF, Zarr, and DataTree are not identical.

cc @owenlittlejohns @flamingbear @eni-awowale


Not sure exactly what the internal implementation of de-deduplication would be (perhaps some kind of NonDuplicatedChainMap?), but @shoyer said he is on it 🙂 A small change to map_over_subtree might also still be required to close this issue.

@shoyer
Copy link
Member Author

shoyer commented Sep 12, 2024

It should be straightforward enough to handle duplicated index coordinates. Non-identical coordinates with an index are already disallowed between parent and child nodes, because the parent and child nodes would fail the alignment check in _pre_attach.

However, there is one tricky case: duplicated coordinates without an associated index. Here's an example of how these currently manifest themselves:

In [33]: tree = DataTree(Dataset(coords={'x': 0}), children={'child': DataTree(Dataset({'y': 1}))})

In [34]: tree
Out[34]:
<xarray.DataTree>
Group: /
│   Dimensions:  ()
│   Coordinates:
│       x        int64 8B 0
└── Group: /child
        Dimensions:  ()
        Data variables:
            y        int64 8B 1

In [35]: tree.map_over_subtree(lambda ds: ds)
Out[35]:
<xarray.DataTree>
Group: /
│   Dimensions:  ()
│   Coordinates:
│       x        int64 8B 0
└── Group: /child
        Dimensions:  ()
        Coordinates:
            x        int64 8B 0
        Data variables:
            y        int64 8B 1

Notice the duplicated coordinates /x and /child/x.

For consistency with the rest of Xarray, we should try to handle this like duplicate coordinates that result from other Xarray operations like arithmetic. As noted in #9481, the current strategy requires evaluating coordinate values to see if they are equal or not in order to decide where they appear. This is attractive for DataTree as well, but unfortunately is not compatible with lazy evaluation.

Instead, I propose that we switch to the equivalent of compat='override, similar to the proposal for Dataset/DataArray coordinates proposed in #9481. On DataTree, if an operation results in a node with a (non-index) coordinate with the same name as a parent coordinate, the child coordinate should be silently dropped.

An exception to this rule are operations that explicitly assign a child coordinate, e.g., DataTree.coords.__setitem__. Here we should raise an error (noting the existence of the parent coordinate) rather than being a silent no-op.

@eni-awowale
Copy link
Collaborator

Thanks for this great write up! I think in the long run this will probably help to avoid confusion and is probably the best way forward. But from more of a DAAC archiving, service provider and data validation perspective (lol I wear a lot of hats) this makes me a little nervous. At GES DISC we use DataTree to do data validation checks against server hosted services and cloud hosted services. The part that makes me nervous is that using DataTree to open netCDF4 files would be modifying the orginal file by removing the duplicate coordinate variables at each node. I don't think this will actually break anything in the backend for us but there is some complexity especialy with metadata that was mentioned.

Would folks be opposed to some kind of flag that's like inherit=True or de_duplicate=True? I think we could set that to being the default behavior and give the user an option to turn it off.

@TomNicholas
Copy link
Contributor

Would folks be opposed to some kind of flag that's like inherit=True or de_duplicate=True? I think we could set that to being the default behavior and give the user an option to turn it off.

The proposed change here would make such a flag impossible - your intended result of de_duplicate=False would now be forbidden by the DataTree data model.

What we could perhaps do though is add a flag to constructors / openers that would raise on encountering duplicated coordinates instead of silently de-duplicating them. e.g.

dt = open_datatree(unvalidated.nc, duplicate_coords='raise')

where the error message tells you which coordinates are duplicated, and refers you to open_groups. The default could still be 'ignore' or maybe 'warn'. There's precedent for treating potential errors this way, see for example

missing_dims: ErrorOptionsWithWarn = "raise",

@shoyer
Copy link
Member Author

shoyer commented Sep 12, 2024

What we could perhaps do though is add a flag to constructors / openers that would raise on encountering duplicated coordinates instead of silently de-duplicating them. e.g.

Yes, this is a great idea!

I would actually suggest such a stricter mode for open_datatree by default. Users who write data from Xarray will not encounter the issue, because Xarray will not be able to represent such datasets. So it's only something that will effect loading data from an external data provider, and it seems better to force users to make an intentional choice in such cases.

The cases where we should be more lenient by default (automatically dropping conflicts) are situations where users are constructing a new DataTree from a collection of Dataset objects, which were likely created using inherited=True.

@TomNicholas
Copy link
Contributor

TomNicholas commented Sep 12, 2024

If we add the flag kwarg to DataTree.from_dict() then it can just be passed down from open_datatree immediately, and we can also explicitly write DataTree.from_dict(..., duplicate_coords='ignore') inside map_over_subtree to make our intent clear internally 🙂

@TomNicholas
Copy link
Contributor

So summarising another meeting's worth of discussion on this... (including special guest @castelao)

The de-duplication idea has some issues.

  1. map_over_subtree vs DataTree.__setitem__

As implied above, any de-duplication should ideally occur in the data model itself, not as a special feature of map_over_subtree. That means that the de-duplication also has to be applied for DataTree.__setitem__.

  1. De-duplication vs Overriding

Comparing two coordinate variables to decide if they are de-duplicated could just be done by comparing names, as in #9510 (comment). But removing anything of the same name (and doing so in __setitem__) basically completely forbids "overriding" inherited coordinates with a new coordinate of the same name but different values. This would be a valid data model for DataTree, but an even more restrictive one than what we currently have (the ChainMap implementation we currently have allows for overriding any coordinate).

  1. De-duplicating lazy variables by id

If you don't compare by names you have to compare something else. You could compare the id of the underlying Variable, i.e. id(Variable), but this will not recognize copies as being duplicates, even if their values are the same. e.g.

def mean(ds: Dataset) -> Dataset:
    return ds.mean()  # this creates a shallow copy of all the variables!

dt.map_over_subtree(mean)  # this would therefore still end up with duplicated coordinates
  1. Only inherit index variables?

You can instead identify duplicated coordinates by comparing values directly, but this implies loading the variable into memory. Mostly our coordinate variables will be backed by in-memory indexes, so we can compare those and all would be fine. So a data model of "only inherit index-backed coordinates" works quite nicely, in that you can then always cheaply do comparison of inherited coordinates to check for de-duplication, and you can distinguish overridden from duplicated coordinates.

  1. Accessing non-indexed inherited variables?

The fly in that ointment is that it's possible to have coordinate variables that are not backed by indexes, and as these can still be multi-dimensional they can still lazily point to large amounts of data.

If the data model is now "only inherit coordinate variables backed by indexes", then it limits the usefulness of inheritance.

If we try to do inheritance of non-indexed coordinate variables, we can't use the solution from step (4).

@kmuehlbauer
Copy link
Contributor

kmuehlbauer commented Sep 18, 2024

My only concern is what @eni-awowale already mentioned, removing duplicated coordinates from child-groups will break the original file data model. Overriding coordinates in child groups is also a CF conventions feature, it should at least be handled correctly via some switch/flag.

shoyer added a commit to shoyer/xarray that referenced this issue Sep 22, 2024
This is a _partial_ solution to the duplicate coordinates issue from pydata#9475.

Here we remove all duplicate coordinates between parent and child nodes
with an index (these are already checked for equality via the alignment
check).

Other repeated coordinates (which we cannot automatically check for
equality) are still allowed for now. We will need an alternative
solution for these, as discussed in pydata#9475, but it is less obvious
what the right solution is so I'm holding off on it for now.
TomNicholas added a commit to TomNicholas/xarray that referenced this issue Sep 23, 2024
shoyer added a commit to shoyer/xarray that referenced this issue Sep 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug topic-DataTree Related to the implementation of a DataTree class
Projects
None yet
Development

No branches or pull requests

4 participants