Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flexible indexes: how to handle possible dimension vs. coordinate name conflicts? #5738

Closed
benbovy opened this issue Aug 25, 2021 · 4 comments

Comments

@benbovy
Copy link
Member

benbovy commented Aug 25, 2021

Another thing that I've noticed while working on #5692.

Currently it is not possible to have a Dataset with a same name used for both a dimension and a multi-index level. I guess the reason is to prevent some errors like unmatched dimension sizes when eventually the multi-index is dropped with renamed dimension(s) according to the level names (e.g., with sel or unstack). See #2299.

I'm wondering how we should handle this in the context of flexible / custom indexes:

A. Keep this current behavior as a special case for (pandas) multi-indexes. This would avoid breaking changes but how to support custom indexes that could eventually be used like pandas multi-indexes in sel or stack?

B. Introduce some tag in xarray.Index so that we can identify a multi-coordinate index that behaves like a hierarchical index (i.e., levels may be dropped into a single index/coordinate with dimension renaming)

C. Do not allow any dimension name matching the name of a coordinate attached to a multi-coordinate index. This seems silly?

D. Eventually revert #2353 and let users taking care of potential conflicts.

@benbovy
Copy link
Member Author

benbovy commented Aug 26, 2021

@fujiisoup @shoyer @aseyboldt

@benbovy
Copy link
Member Author

benbovy commented Aug 26, 2021

My initial thoughts was to opt for D because it is easier to maintain (less special cases, less complexity, internally we could get rid of assert_unique_multiindex_level_names after dropping multi-index virtual variables) and also because an error is raised when there's actually an issue, e.g., we could do stack and never plan to do unstack later so in that case multi-index level vs. dimension name conflict is not an issue?

I was thinking about this kind of behavior (even though reverting previous changes is not ideal):

ds = xr.Dataset(coords={'dim0': ['a', 'b'], 'dim1': [0, 1]})
ds = ds.stack(dim_stacked=['dim0', 'dim1'])

# This works: dim0 is not a dimension in b
ds['c'] = (('dim0',), [10, 11, 12, 13, 14, 15])

# raise a nice error message here: conflicting sizes for dimension 'dim0'
ds.unstack(dim_stacked)

# raise a nice error message here: conflicting sizes for dimension 'dim0'
ds.sel(dim1=0)

However, this may be confusing in the case of integer-based vs. label based indexing:

# label-based selection along `dim_stacked` dimension (length=4)
ds.sel(dim0='a')

# integer-based selection along `dim0` dimension (length=6)
ds.isel(dim0=0)

So a more general rule like option C is not that silly after all?

If a dimension name matches the name of a coordinate:

  • Ok if this coordinate is a "dimension" coordinate (coordinate name == dimension name)
  • (maybe ok if this this coordinate is not indexed?)
  • Error otherwise

@shoyer
Copy link
Member

shoyer commented Aug 26, 2021

In the long term, I think we want to eliminate any requirements in Xarray's data model about what variables names are OK.

In particular:

  1. We want to allow a coordinate named "foo" with different dimensions than ("foo",), in order to support reading all possible netCDF files into xarray.
  2. We want to allow indexing with coordinates that are not dimensions.

With this second change in particular, it should not be a problem to have a multi-index level with the same name as a dimension.

It is true that this change will introduce an inconsistency between appropriate keys for use with sel (indexed coordinates) and isel (dimensions). But I think this is basically inevitable.

@benbovy
Copy link
Member Author

benbovy commented Aug 23, 2023

Both 1 and 2 are now supported with v2023.8.0, so I think we can close this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

No branches or pull requests

2 participants