Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Design for IntervalIndex #8005

Open
dcherian opened this issue Jul 19, 2023 · 5 comments
Open

Design for IntervalIndex #8005

dcherian opened this issue Jul 19, 2023 · 5 comments

Comments

@dcherian
Copy link
Contributor

dcherian commented Jul 19, 2023

Is your feature request related to a problem?

We should add a wrapper for pandas.IntervalIndex this would solve a long standing problem around propagating "bounds" variables (CF conventions, #1475)

The CF design

CF "encoding" for intervals is to use bounds variables. There is an attribute "bounds" on the dimension coordinate, that refers to a second variable (at least 2D). Example: x has an attribute bounds that refers to x_bounds.

import numpy as np

left = np.arange(0.5, 3.6, 1)
right = np.arange(1.5, 4.6, 1)
bounds = np.stack([left, right])

ds = xr.Dataset(
    {"data": ("x", [1, 2, 3, 4])},
    coords={"x": ("x", [1, 2, 3, 4], {"bounds": "x_bounds"}), "x_bounds": (("bnds", "x"), bounds)},
)
ds
image

A fundamental problem with our current data model is that we lose x_bounds when we extract ds.data because there is a dimension bnds that is not shared with ds.data. Very important metadata is now lost!

We would also like to use the "bounds" to enable interval based indexing. ds.sel(x=1.1) should give you the value from the appropriate interval.

Pandas IntervalIndex

All the indexing is easy to implement by wrapping pandas.IntervalIndex, but there is one limitation. pd.IntervalIndex saves two pieces of information for each interval (left bound, right bound). CF saves three : left bound, right bound (see x_bounds) and a "central" value (see x). This should be OK to work around in our wrapper.

Fundamental Question

To me, a core question is whether x_bounds needs to be preserved after creating an IntervalIndex.

  1. If so, we need a better rule around coordinate variable propagation. In this case, the IntervalIndex would be associated with x and x_bounds. So the rule could be

    "propagate all variables necessary to propagate an index associated with any of the dimensions on the extracted variable."

    So when extracting ds.data we propagate all variables necessary to propagate indexes associated with ds.data.dims that is x which would say "propagate x, x_bounds, and the IntervalIndex.

  2. Alternatively, we could choose to drop x_bounds entirely. I interpret this approach as "decoding" the bounds variable to an interval index object. When saving to disk, we would encode the interval index in two variables. (See below)

Describe the solution you'd like

I've prototyped (2) [approach 1 in this notebook) following @benbovy's suggestion

from xarray import Variable
from xarray.indexes import PandasIndex


class XarrayIntervalIndex(PandasIndex):
    def __init__(self, index, dim, coord_dtype):
        assert isinstance(index, pd.IntervalIndex)

        # for PandasIndex
        self.index = index
        self.dim = dim
        self.coord_dtype = coord_dtype

    @classmethod
    def from_variables(cls, variables, options):
        assert len(variables) == 1
        (dim,) = tuple(variables)
        bounds = options["bounds"]
        assert isinstance(bounds, (xr.DataArray, xr.Variable))

        (axis,) = bounds.get_axis_num(set(bounds.dims) - {dim})
        left, right = np.split(bounds.data, 2, axis=axis)
        index = pd.IntervalIndex.from_arrays(left.squeeze(), right.squeeze())
        coord_dtype = bounds.dtype

        return cls(index, dim, coord_dtype)

    def create_variables(self, variables):
        from xarray.core.indexing import PandasIndexingAdapter

        newvars = {self.dim: xr.Variable(self.dim, PandasIndexingAdapter(self.index))}
        return newvars

    def __repr__(self):
        string = f"Xarray{self.index!r}"
        return string

    def to_pandas_index(self):
        return self.index

    @property
    def mid(self):
        return PandasIndex(self.index.right, self.dim, self.coord_dtype)

    @property
    def left(self):
        return PandasIndex(self.index.right, self.dim, self.coord_dtype)

    @property
    def right(self):
        return PandasIndex(self.index.right, self.dim, self.coord_dtype)
ds1 = (
    ds.drop_indexes("x")
    .set_xindex("x", XarrayIntervalIndex, bounds=ds.x_bounds)
    .drop_vars("x_bounds")
)
ds1
image
ds1.sel(x=1.1)
image

Describe alternatives you've considered

I've tried some approaches in this notebook

@dcherian
Copy link
Contributor Author

dcherian commented Jul 19, 2023

I guess there is a more general framing of this problem.

rioxarray chooses to store CRS information on a scalar variable named spatial_ref. Because it is a scalar, there are no issues around propagating that information as a coordinate variable. However, once a CRSIndex is created, should spatial_ref still be preserved or not?

Or more generally, should all the variables needed to construct an Index be preserved after the Index is created. If so, we'll need a new rule around coordinate variable propagation to enable IntervalIndex

cc @scottyhq @snowman2

@snowman2
Copy link
Contributor

However, once a CRSIndex is created, should spatial_ref still be preserved or not?

  • One of factors in the design of rioxarray is to be able to call to_netcdf and have it create a geospatially correct netCDF file. If you remove spatial_ref or other coordinates with the CRS information, it will no longer be correct. However, if xarray adds it back in before writing, that would resolve the issue.
  • If you remove it, rioxarray will need a way to find the CRS from the index.

@benbovy
Copy link
Member

benbovy commented Jul 19, 2023

Or more generally, should all the variables needed to construct an Index be preserved after the Index is created. If so, we'll need a new rule around coordinate variable propagation to enable IntervalIndex

Maybe this could be entirely left to the custom index whether to keep (and propagate) some coordinates or discard them after the index is created and/or after some other operation?

The API of Index should be flexible enough to allow that (if it is not, should we make it more flexible?). For example:

  • Index.create_variables(variables): takes a mapping Xarray variables and returns another mapping of variables. This is where one could drop or replace coordinates after creating the index (not sure it would work currently but we could easily fix that I guess)
  • Index.sel: coordinate variables may be propagated or dropped in the returned result

@benbovy
Copy link
Member

benbovy commented Aug 24, 2023

@dcherian did you experiment further with this? I might want to take a stab at it, maybe in a draft PR.

CF saves three : left bound, right bound (see x_bounds) and a "central" value (see x). This should be OK to work around in our wrapper.

One solution for this case is to:

  • transform the "x_bounds" coordinate
    • reshape from either (N+1) or (N, 2) to (N)
    • rename its dimension "bnds" -> "x"
    • wrap a pd.IntervalArray as .data
  • bind the new IntervalIndex to both the "x" coordinate and the (transformed) "x_bounds" coordinate.

To transform the indexed "x_bounds" coordinate back to its original form (for serialization), we could add a IntervalIndex.expand_bound_dims() method. The "bnds" dimension name could be provided by the user or be retrieved from the info saved in the index object. This operation would drop the IntervalIndex for "x" and "x_bounds" and restore the PandasIndex for "x". For convenience, we could also add Dataset.interval.expand_bound_dims() (accessor) that does that for all IntervalIndex objects found in the dataset.

If that makes sense, maybe IntervalIndex could wrap both a regular pd.Index for "x" and a pd.IntervalIndex for "x_bounds". This would allow supporting different kinds of selection:

ds.sel(x=15, method="nearest")   # dispatch to pd.Index (method 'nearest' not supported for pd.IntervalIndex)
ds.sel(x_bounds=3.5)             # dispatch to pd.IntervalIndex

This solution allows to fully leverage the interval index (central values + intervals) in a way that is unambiguous and compatible with the current rules of coordinate variable propagation for DataArrays.

@dcherian
Copy link
Contributor Author

dcherian commented Sep 9, 2023

did you experiment further with this? I might want to take a stab at it, maybe in a draft PR

Of course!

We discussed this a bit in a meeting a while ago and there was broad agreement that we should consider updating our rule to propagate all indexes associated with a DataArray's dimensions. I think this would be a good experimental PR.

To transform the indexed "x_bounds" coordinate back to its original form (for serialization)

Personally, I think it would be nice to minimize these transformations to avoid having to teach users new concepts when writing to disk

ds.sel(x=15, method="nearest")   # dispatch to pd.Index (method 'nearest' not supported for pd.IntervalIndex)
ds.sel(x_bounds=3.5)             # dispatch to pd.IntervalIndex

I prefer just ds.sel(x=15) because the bounds have physical meaning. We could support exact indexing to the central value with ds.sel(x=15, method="exact") or method="center" (i do not think this should be the default)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants