-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
scalar_level in MultiIndex #1426
Conversation
xarray/tests/test_dataset.py
Outdated
Dimensions: (x: 2) | ||
Coordinates: | ||
* x (x) MultiIndex | ||
- level_1 <U1 'a' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now test fails here for Python2.7. Python2.7 seems to understand the dtype of str
as |S1
not <U1
. This should be solved but is not related to scalar-level of MultiIndex.
Sorry for the delay getting back to you here -- I'm still thinking through the implications of this change. This does make the handling of However, taking a step back, I wonder if this is the right approach. In many ways, structured dtypes are similar to xarray's existing data structures, so supporting them fully means a lot of duplicated functionality. MultiIndexes (especially with scalars) should work similarly to separate variables, but they are implemented very differently under the hood (all the data lives in one variable). (See pandas-dev/pandas#3443 for related discussion about pandas and It occurs to me that if we had full support for indexing on coordinate levels, we might not need a notion of a "MultiIndex" in the public API at all. To make this more concrete, what if this was the
If we supported Pandas has CC @benbovy |
@shoyer Thanks for the comment.
Actually I am not yet fully comfortable with my implementation, If my understanding is correct, does it mean that we will support |
This would be awesome and so much clearer for many users including me, who understand "coordinates" much better than "MultiIndex". |
I also fully agree that using multiple coordinate (index) variables instead of a A dimension with a single 'real' coordinate (i.e., an Using multiple 'real' coordinates, I don't see any reason why
I'm thinking about something like this:
It may present several advantages:
|
I was only thinking about @benbovy although a Right now, our user facing API in xarray exposes three related concepts:
Eliminating any of these concepts would be an improvement. To this end, I have two (vague) proposals:
|
@shoyer <xarray.Dataset>
Dimensions: (yx: 6)
Coordinates:
y (yx) object 'a' 'a' 'a' 'b' 'b' 'b'
Data variables:
foo (yx) int64 1 2 3 4 5 6 (which may be generated by indexing from <xarray.Dataset>
Dimensions: (y: 6)
Coordinates:
* y (y) object 'a' 'a' 'a' 'b' 'b' 'b'
Data variables:
foo (y) int64 1 2 3 4 5 6 What is the possible confusion if we adopt 2? |
@fujiisoup I agree that given your example proposal 2 might be more intuitive, however IMHO implicit indexes seem a bit too magical indeed. Although I don't have any concrete example in mind, I guess that sometimes I would be hard to really understand what's going on. Exposing less concepts to users would be indeed an improvement, unless it makes things too implicit or magical. Let me try to give a more detailed proposal than in my previous comment, which generalizes to potential features like multi-dimensional indexers (see @shoyer's comment, which I'd be happy to start working on soon). It is actually very much like proposal 1, with only one additional concept (called "super index" below).
Examples of super indexes:
"Super index" is an additional concept that has to be understood by users, which is in principle bad, but here I think it's worth as it potentially gives a good generic model for explicit handling of various, advanced indexes that involve multiple coordinates. |
@benbovy I think I like your proposal, which bundles multiple concepts in xarray such as Currently, 'rasm' example is like In [1]: import xarray as xr
In [2]: xr.tutorial.load_dataset('rasm', decode_times=False)
Out[2]:
<xarray.Dataset>
Dimensions: (time: 36, x: 275, y: 205)
Coordinates:
* time (time) float64 7.226e+05 7.226e+05 7.227e+05 7.227e+05 ...
xc (y, x) float64 189.2 189.4 189.6 189.7 189.9 190.1 190.2 190.4 ...
yc (y, x) float64 16.53 16.78 17.02 17.27 17.51 17.76 18.0 18.25 ...
Dimensions without coordinates: x, y
Data variables:
Tair (time, y, x) float64 nan nan nan nan nan nan nan nan nan nan ...
Attributes:
... Does your proposal (automatically) change this like <xarray.Dataset>
Dimensions: (time: 36, xy: 56375)
Coordinates:
* time (time) float64 7.226e+05 7.226e+05 7.227e+05 7.227e+05 ...
xc (xy) float64 189.2 189.0 188.7 188.5 188.2 187.9 187.7 187.4 ...
yc (xy) float64 16.53 16.69 16.85 17.01 17.17 17.32 17.48 17.63 ...
* xy (xy) SuperIndex
- x (xy) int64 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...
- y (xy) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 ...
Data variables:
Tair (time, xy) float64 nan nan nan nan nan nan nan nan nan nan nan ...
Attributes:
... ? |
Although I haven't thought about all the details regarding this, I think that in the case of multi-dimensional coordinates a "super index" would rather allow directly using these coordinates for indexing, which is currently not possible. In your 'rasm' example, it would rather look like <xarray.Dataset>
Dimensions: (time: 36, x: 275, y: 205)
Dimensions without coordinates: y, x
Coordinates:
* time (time) float64 7.226e+05 7.226e+05 7.227e+05 7.227e+05 ...
* spatial_index (y, x) KDTree
- xc (y, x) float64 189.2 189.4 189.6 189.7 189.9 190.1 190.2 190.4 ...
- yc (y, x) float64 16.53 16.78 17.02 17.27 17.51 17.76 18.0 18.25 ...
Dimensions without coordinates: x, y
Data variables:
Tair (time, y, x) float64 nan nan nan nan nan nan nan nan nan nan ...
Attributes:
... and it would allow writing In [1]: ds.sel(xc=<...>, yc=<...>, method='nearest') Note that That's actually what @shoyer suggested here. The proposal above is more about having the same API for groups of coordinates that can be indexed using a "wrapped" index object (maybe "wrapped index" is a better name than "super index"?), but the logic can be very different from one index object to another. |
I'll close this for the recent discussion about MultiIndex |
git diff upstream/master | flake8 --diff
whats-new.rst
for all changes andapi.rst
for new API[Edit for more clarity]
I restarted a new branch to fix #1408 (I closed the older one #1412).
Because the changes I made is relatively large, here I summarize this PR.
Sumamry
In this PR, I newly added two kinds of levels in MultiIndex,
index-level
andscalar-level
.index-level
is an ordinary level in MultiIndex (as in current implementation),while
scalar-level
indicates dropped level (which is newly added in this PR).Changes in behaviors.
scalar-level
instead of dropping that level (changed from MultiIndex and data selection #767).MultiIndex-scalar
rather than a scalar of tuple.index-level
if the MultiIndex has only a singleindex-level
.Examples of the output are shown below.
Any suggestions for these behaviors are welcome.
Changes in the public APIs
Some changes were necessary to the public APIs, though I tried to minimize them.
level_names
,get_level_values
methods were moved fromIndexVariable
toVariable
.This is because
IndexVariable
cannnot handle 0-d array, which I want to support in 2.scalar_level_names
andall_level_names
properties were added toVariable
reset_levels
method was added toVariable
class to controlscalar-level
andindex-level
.Implementation summary
The main changes in the implementation is the addition of our own wrapper of
pd.MultiIndex
,PandasMultiIndexAdapter
.This does most of
MultiIndex
-related operations, such as indexing, concatenation, conversion between 'scalar-leveland
index-level`.What we can do now
The main merit of this proposal is that it enables us to handle
MultiIndex
more consistent way to the normalVariable
.Now we can
What we cannot do now
With the current implementation, we can do
but with this PR we cannot, because
x
is not yet an ordinary coordinate, but a MultiIndex with a singleindex-level
.I think it is better if we can handle such a MultiIndex with a single
index-level
as very similar way to an ordinary coordinate.Similary, we can neither do
ds.sel(y='a').mean(dim='x')
.Also,
ds.sel(y='a').to_netcdf('file')
(#719)What are to be decided
repr
these new levels (Current formatting is shown in Out[2] and Out[3] above.)index-level
,scalar-level
,MultiIndex-scalar
are clear enough?index-level
MultiIndex?Do we support
ds.sel(y='a').rolling(x=2)
andds.sel(y='a').mean(dim='x')
?TODOs
ds.sel(x=ds.x[0])
stack
,unstack
,set_index
,reset_index
methods withscalar-level
MultiIndex.