Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Xarray's explicit indexes #35

Open
Huite opened this issue Jan 4, 2023 · 5 comments
Open

Xarray's explicit indexes #35

Huite opened this issue Jan 4, 2023 · 5 comments

Comments

@Huite
Copy link
Collaborator

Huite commented Jan 4, 2023

For examples, see:

https://github.com/martinfleis/xvec
https://github.com/dcherian/crsindex/blob/main/crsindex.ipynb

They can be set with:
https://docs.xarray.dev/en/stable/generated/xarray.DataArray.set_xindex.html

One issue is that in the UGRID conventions, the connectivity arrays are 2D. So the first dimension matches whatever's going on in the data (node, edge, face), but the second doesn't. This can be avoided with a trick using numpy structured arrays:

import numpy as np
import xarray as xr

ds = xr.open_dataset("Grevelingen_FM_grid_20190603_0001_net.nc")
faces2d = ds["NetElemNode"].values.astype(int) - 1  # face node_connectivity
triangle = np.dtype([("0", int), ("1", int), ("2", int)])
faces = np.empty(len(faces2d), dtype=triangle)
faces.view(int)[:] = faces2d.ravel()
ds = ds.assign_coords(faces=("nNetElem", faces))

This feels like a hack, and doesn't seem like a serious option.
An alternative approach is perhaps using a Pandas ExtensionArray. This seems like a lot of work: https://stackoverflow.com/questions/68893521/simple-example-of-pandas-extensionarray

Next, we can define a UgridIndex:

class Ugrid2dIndex(Index):
    def __init__(
        self,
        node_x,
        node_y,
        faces,
    ):
        self._ugrid = xu.Ugrid2d(node_x.values, node_y.values, -1, faces.values.view(int).reshape((-1, 3)))

    @classmethod
    def from_variables(cls, variables, options):
        return cls(
            variables["NetNode_x"],
            variables["NetNode_y"],
            variables["faces"],
        )

This seems to do the trick:

newds = ds.set_xindex(
    (
        "NetNode_x",
        "NetNode_y",
        "faces",
    ),
    Ugrid2dIndex,
)

Now we could start adding methods to the Ugrid2dIndex and make it possible to select values with sel. Or alternatively provide other operations via a .ugrid accessor.

However, it doesn't solve all problems that we encounter with UGRID topologies (?):

  • Node, edge, and face dimension are fully separated according to xarray's data model. Selecting one but not the other will generally invalidate the data <-> topology link. (So solve via a .ugrid.sel and .ugrid.isel instead.)
  • A full UGRID topology requires possibly connectivities with faces, edges, nodes, and possible interrelations. As far I can tell, the index is not carried around for every variable. (Solve by using multiple indexes internally containing full grid? One for faces, one for edges (with edge node connectivity). But seems impractical for data on nodes?)

I think the fundamental question is whether a UgridDataset is properly considered as an xarray Dataset: I don't think so, as it breaks a premise of fully separated, orthogonal dimensions. From that perspective, it may be better to accept this, and special case the UGRID topology dimensions.

@Huite
Copy link
Collaborator Author

Huite commented Jan 6, 2023

More notes:

These explicit indexes are currently insufficiently mature (e.g. tries to convert to PandasIndex now when accessing via ds.indexes), but once mature enough:

  • The UGRID topology could be fully contained within an explicit index. Building the index still requires a numpy structured hack or pandas ExtensionArray to accept the connectivity arrays as coordinates, or requires changes to set_xindex.
  • The indexes cannot be relied upon to persist through many operations, as they may generate invalid topologies. This is currently also the case if (node, face, edge) dimensions are given indexes, but there's no upfront warnings or errors with standard Datasets.

Without changes to the explicit indexes:

  • No UgridDataArray exists, since there's multiple variables required retain the topology information, which isn't split off.
  • Specific operations occur through the .ugrid accessor, generally mentioning the variable name. uds["water_depth"].ugrid.plot() would become uds.ugrid.plot("water_depth").
  • That might get a little cumbersome for operations on multiple Ugrid objects? E.g.: uds.ugrid.func(uds2, "var1", "var2")
  • All operations basically require mutation of the dataset.
  • I.e. ergonomics seem worse in general.

The largest benefit is that it's a good deal simpler. No need to juggle between xarray <-> xugrid types. No need to forward calls, which still feels a bit haphazard to me as it currently stands. Also no custom type which trip up checks in other packages that do understand xarray objects.

@Huite
Copy link
Collaborator Author

Huite commented Jan 6, 2023

Alternatively, maybe a topology and its data should be grouped together in a completely different way, see: https://xarray-datatree.readthedocs.io/en/latest/

This also has a similar drawback, which is that the API looks quite different when working with unstructured viz a viz structured data, which I'd really like to avoid.

@dcherian
Copy link

Node, edge, and face dimension are fully separated according to xarray's data model.

I think you can handle this in Index.sel which will be called by Dataset.sel

One issue is that in the UGRID conventions, the connectivity arrays are 2D. So the first dimension matches whatever's going on in the data (node, edge, face), but the second doesn't.

The same problem exists with handling CF style bounds arrays and IntervalIndex, so I think we have enough motivation to construct a proper solution in Xarray. I'm not entirely sure what that solution looks like (both structured and extension arrays have come up in the discussion).

@Huite
Copy link
Collaborator Author

Huite commented Jan 24, 2023

I think you can handle this in Index.sel which will be called by Dataset.sel

Yes, I think so too now. That would be a very comfortable solution.

The same problem exists with handling CF style bounds arrays and IntervalIndex, so I think we have enough motivation to construct a proper solution in Xarray.

Ah, right -- that's good news!

Ideally, the same index could be associated with all three dimensions somehow (... I think). I think that's the only exceptional part of the UGRID topology versus the other uses of the explicit indexes.

I'm itching to find out what this entails, hopefully I can free up some time to investigate.

@dcherian
Copy link

, the same index could be associated with all three dimensions somehow

This is already possible with Datasets today. In https://github.com/dcherian/xindexes/blob/main/crsindex.ipynb, CRSIndex is associated with the x dimension, y dimension, and spatial_ref scalar variable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants