-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Explicit indexes in xarray's data-model (Future of MultiIndex) #1603
Comments
I'm using Consider the following example, In [1]: import numpy as np
...: import xarray as xr
...: da = xr.DataArray(np.arange(5), dims=['x'],
...: coords={'experiment': ('x', [0, 0, 0, 1, 1]),
...: 'time': ('x', [0.0, 0.1, 0.2, 0.0, 0.15])})
...:
In [2]: da
Out[2]:
<xarray.DataArray (x: 5)>
array([0, 1, 2, 3, 4])
Coordinates:
experiment (x) int64 0 0 0 1 1
time (x) float64 0.0 0.1 0.2 0.0 0.15
Dimensions without coordinates: x I want to do something like this da.sel(experiment=0).sel(time=0.1) but it cannot. In [2]: da = da.set_index(exp_time=['experiment', 'time'])
...: da
...:
Out[2]:
<xarray.DataArray (x: 5)>
array([0, 1, 2, 3, 4])
Coordinates:
* exp_time (exp_time) MultiIndex
- experiment (exp_time) int64 0 0 0 1 1
- time (exp_time) float64 0.0 0.1 0.2 0.0 0.15
Dimensions without coordinates: x If we could make a selection from a non-index coordinate, I think there should be other important usecases of |
One API design challenge here is that I think we still want a explicit notation of "indexed" variables. We could possibly allow operations like |
I sometimes find it helpful to think about what the right For example, we might imagine that "Indexes" are no longer coordinates, but instead their own entry in the repr:
"Indexes" might not even need to be part of the main
In this model:
|
I think we currently assume It sounds a much cleaner data model. |
Yes, exactly. We actually already have an attribute that works like this, but it's current computed lazily, from either |
I think that promoting "Indexes" to a first-class concept is indeed a very good idea, at both internal and public levels, even if at the latter level it would be another concept for users (it should be already familiar for pandas users, though). IMHO the "coordinate" and "index" concepts are different enough to consider them separately. I like the proposed repr for I have to think a bit more about the details but I like the idea. |
@shoyer, could you add more details of this idea?
|
We would still assign default indexes (using a normal Another aspect to consider how to handle alignment when you have indexes along non-dimension coordinates. Probably the most elegant rule would again be to check all indexed variables for exact matches. Directly assigning indexes rather than using this default or For performance reasons, we probably do not want to actually check the values of manually assigned indexes, although we should verify that the shape matches. (We would have a clear disclaimer that if you manually assign an index with mismatched values the behavior is not well defined.) In principle, this data model would allow for two mostly equivalent indexing schemes:
Yes, this is a little unfortunate. We could potentially make a custom wrapper for use in
Every entry in |
Thanks for the details. I am wondering what the advantageous cases which are realized with this
Are they correct?
That sounds reasonable.
I like the latter one, as it is easier to understand even for non-pandas users. What does the actual implementation look like? |
The other advantage is that it solves many of the issues with the current
I agree, but there are probably some advantages to using a MultiIndex internally. For example, it allows for looking up on multiple levels at the same time.
I think we could get away with making For KDTree, this means we'll have to write our own wrapper |
Just to say I'm interested in how MultiIndexes are handled also. In our use case, we have two variables conventionally named CHROM (chromosome) and POS (position) which together describe a location in a genome. I want to combine both variables into a multi-index so I can, e.g., select all data from some data variable for chromosome X between positions 100,000-200,000. For all our data variables, this genome location multi-index would be used to index the first dimension. |
Will the new API preserve the order of the levels? One of the features that's necessary for |
@jjpr-mit can you explain your use case a little more? What sort of order dependent queries do you want to do? The one that comes to mind for me are range based queries, e.g, I think it is still relatively easy to ensure a unique ordering between levels, based on the order of coordinate variables in the xarray dataset. A bigger challenge is that for efficiency, these sorts of queries depend critically on having an actual MultiIndex. This means that if indexes for each of the levels arise from different arguments that were merged together, we might need to "merge" the separate indexes into a joint MultiIndex. This could potentially be slightly expensive. |
Many array types do have implicit indices. Going one step further, one could have continuous dimensions where positional indexing ( => Having explicit and implicit indices on arrays would be awesome, even if they don't support all xarray features! |
Well, maybe we can consider the coordinates in a more generic way. Let us define coordinate an array in data set cause co-indexed when we index its data set. It means that:
Use dims to determined the way how other array of the data set will be co-indexed.
Some compatibility issues:
|
Hi @weipeng1999, I'm not sure to fully understand your suggestion, would you mind sharing some illustrative examples? It is useful to have two distinct It also helps to have a clear separation between the Currently in Xarray the It looks like what you suggest is some kind of implicit (co-)indexes hidden behind any dataset variable(s)? We actually took the opposite direction, trying to make everything explicit. |
Try to explain my idea, I make a PPT. |
Thanks for the detailed description @weipeng1999. For the first 4 slides I don't see how this is different from how does |
thank you for figuring out the wrong things what I done. Well, it' is hard to explain the idea because it is a bit complicated, the last two picture is wrong and make misunderstanding, here are two images explain what I actuarily mean: |
Sorry but this is confusing. To me It still looks like you want implicit broadcasting of the |
For such case you could already do After the explicit index refactor, we could imagine a custom index that supports multi-dimension coordinates such that you would only need to do something like >>> S_res = S4.sel(C2=("z", ["a", "e", "h"]))
>>> S_res
<xarray.Dataset>
Dimensions: (z: 3)
Coordinates:
* C2 (z) <U1 'a' 'e' 'h'
Data variables:
A1 (z) float64 4 3 3 or without explicitly providing the name of the packed dimension: >>> S_res = S4.sel(C2=["a", "e", "h"])
>>> S_res
<xarray.Dataset>
Dimensions: (C2: 3)
Coordinates:
* C2 (C2) <U1 'a' 'e' 'h'
Data variables:
A1 (C2) float64 4 3 3 |
well, both "contain the origin dims" or just "generate another one" have its benefit.
|
So I think maintain the origin dims may do less broken on current code. |
Agreed, and both are supported by xarray actually. In case we want to keep the original dimensions like ("x", "y") in the example above, it's better to use masking. This discussion is broader than the topic covered in this issue so I'd suggest you start a new discussion if you want to further discuss this with the xarray community. Thanks. |
Should we close this issue and continue the discussion in #6293? For anyone who wants to track the progress on this topic: https://github.com/pydata/xarray/projects/1 |
I think we can continue the discussion we have in #1426 about
MultiIndex
here.In comment , @shoyer recommended to remove
MultiIndex
from public API.I agree with this, as long as my codes work with this improvement.
I think if we could have a list of possible
MultiIndex
use cases here,it would be easier to deeply discuss and arrive at a consensus of the future API.
Current limitations of
MultiIndex
areThe text was updated successfully, but these errors were encountered: