Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explicit indexes in xarray's data-model (Future of MultiIndex) #1603

Closed
fujiisoup opened this issue Oct 4, 2017 · 68 comments
Closed

Explicit indexes in xarray's data-model (Future of MultiIndex) #1603

fujiisoup opened this issue Oct 4, 2017 · 68 comments

Comments

@fujiisoup
Copy link
Member

fujiisoup commented Oct 4, 2017

I think we can continue the discussion we have in #1426 about MultiIndex here.

In comment , @shoyer recommended to remove MultiIndex from public API.

I agree with this, as long as my codes work with this improvement.

I think if we could have a list of possible MultiIndex use cases here,
it would be easier to deeply discuss and arrive at a consensus of the future API.

Current limitations of MultiIndex are

@fujiisoup
Copy link
Member Author

I'm using MultiIndex a lot,
but I noticed that it is just a workaround to index along multiple kinds of coordinate.

Consider the following example,

In [1]: import numpy as np
   ...: import xarray as xr
   ...: da = xr.DataArray(np.arange(5), dims=['x'],
   ...:                   coords={'experiment': ('x', [0, 0, 0, 1, 1]),
   ...:                           'time': ('x', [0.0, 0.1, 0.2, 0.0, 0.15])})
   ...: 

In [2]: da
Out[2]: 
<xarray.DataArray (x: 5)>
array([0, 1, 2, 3, 4])
Coordinates:
    experiment  (x) int64 0 0 0 1 1 
    time        (x) float64 0.0 0.1 0.2 0.0 0.15
Dimensions without coordinates: x

I want to do something like this

da.sel(experiment=0).sel(time=0.1)

but it cannot.
MultiIndexing enables this,

In [2]: da = da.set_index(exp_time=['experiment', 'time'])
   ...: da
   ...: 
Out[2]: 
<xarray.DataArray (x: 5)>
array([0, 1, 2, 3, 4])
Coordinates:
  * exp_time    (exp_time) MultiIndex
  - experiment  (exp_time) int64 0 0 0 1 1 
  - time        (exp_time) float64 0.0 0.1 0.2 0.0 0.15
Dimensions without coordinates: x

If we could make a selection from a non-index coordinate,
MultiIndex is not necessary for this case.

I think there should be other important usecases of MultiIndex.
I would be happy if anyone could list them in this issue.

@shoyer
Copy link
Member

shoyer commented Oct 4, 2017

One API design challenge here is that I think we still want a explicit notation of "indexed" variables. We could possibly allow operations like .sel() on non-indexed variables, but they would be slower, because we would not want to create expensive hash-tables (i.e., pandas.Index) in a non-transparent fashion.

@shoyer
Copy link
Member

shoyer commented Oct 4, 2017

I sometimes find it helpful to think about what the right repr() looks right, and then work backwards from there to the right data model.

For example, we might imagine that "Indexes" are no longer coordinates, but instead their own entry in the repr:

<xarray.Dataset (exp_time: 5)>
Coordinates:
  * experiment  (exp_time) int64 0 0 0 1 1 
  * time        (exp_time) float64 0.0 0.1 0.2 0.0 0.15
Indexes:
    exp_time: pandas.MultiIndex[experiment, time]

"Indexes" might not even need to be part of the main Dataset.__repr__, but it would certainly be the repr for Dataset.indexes. Other entries could include:

    time: pandas.Datetime64Index[time]
    space: scipy.spatial.KDTree[latitude, longitude]

In this model:

  1. We would promote "Indexes" to a first-class concept in the xarray data model:
    (a) The levels of a MultiIndex would have corresponding Variable objects and be found in coords.
    (b) In contrast, theMultiIndex would not have a corresponding Variable object or be part of coords, though it could still be returned upon __getitem__ access (computed on demand from .indexes).
    (c) Dataset and DataArray would gain an indexes argument in their constructors, which could be used for passing indexes on to new xarray objects.
  2. Coordinates marked with * are part of an index. They can't be modified, unless all corresponding indexes ares removed.
  3. Indexes would still be propagated, like coordinates.

@fujiisoup
Copy link
Member Author

I think we currently assume variables[dim] is an Index.
Does your proposal means that Dataset will keep an additional attribute indexes, and indexes[dim] gives a pd.Index (or pd.MultiIndex, KDTree)?

It sounds a much cleaner data model.

@shoyer
Copy link
Member

shoyer commented Oct 4, 2017

Does your proposal means that Dataset will keep an additional attribute indexes, and indexes[dim] gives a pd.Index (or pd.MultiIndex, KDTree)?

Yes, exactly. We actually already have an attribute that works like this, but it's current computed lazily, from either Dataset._variables or DataArray._coords.

@shoyer
Copy link
Member

shoyer commented Oct 4, 2017

CC @benbovy @fmaussion

@benbovy
Copy link
Member

benbovy commented Oct 4, 2017

I think that promoting "Indexes" to a first-class concept is indeed a very good idea, at both internal and public levels, even if at the latter level it would be another concept for users (it should be already familiar for pandas users, though). IMHO the "coordinate" and "index" concepts are different enough to consider them separately.

I like the proposed repr for Dataset.indexes. I wouldn't mind if it is not included in Dataset.__repr__, considering that multi-indexes, kdtree, etc. only represent a few use cases. In too many cases it could result in a long, uninformative list of simple pandas.Index.

I have to think a bit more about the details but I like the idea.

@fujiisoup
Copy link
Member Author

fujiisoup commented Oct 4, 2017

@shoyer, could you add more details of this idea?
I think I do not yet fully understand the practical difference between dim and index.

  1. Use cases of the independent Index and dims
    Would it be general cases where dimension and index are independent?
    (It is the case only for MultiIndex and KDtree)?

  2. MultiIndex implementation
    In MultiIndex case, will a xarray object store a MultiIndex object and also the level variables as Variable objects (there will be some duplicates)?
    If indexes[dim] returns multiple Variables, which realizes a MultiIndex-like structure without pd.MultiIndex, indexes would be very different from dim,
    because a single dimension can have multiple indexes.

@shoyer
Copy link
Member

shoyer commented Oct 4, 2017

  1. Use cases of the independent Index and dims
    Would it be general cases where dimension and index are independent?
    (It is the case only for MultiIndex and KDtree)?

We would still assign default indexes (using a normal pandas.Index) when you assign a 1D coordinate with matching name and dimension. But in general, yes, it seems like you should be able to make an index even for variables that aren't dimensions, including for a 1D variable whose name does not match a dimension. The rule would be that any coordinates can be part of an index.

Another aspect to consider how to handle alignment when you have indexes along non-dimension coordinates. Probably the most elegant rule would again be to check all indexed variables for exact matches.

Directly assigning indexes rather than using this default or set_index() would be an advanced feature, not recommended for everyday use. The main use case is routines which create a new xarray object based on an existing one, and want to re-use old indexes.

For performance reasons, we probably do not want to actually check the values of manually assigned indexes, although we should verify that the shape matches. (We would have a clear disclaimer that if you manually assign an index with mismatched values the behavior is not well defined.)

In principle, this data model would allow for two mostly equivalent indexing schemes: MultiIndex[time, space] vs two indexes Index[time] and Index[space]. We would need to figure out how to propagate and compare indexes like this. (I suppose if the coordinate values match, the result could have the union of all indexes from input arguments.)

  1. MultiIndex implementation
    In MultiIndex case, will a xarray object store a MultiIndex object and also the level variables as Variable objects (there will be some duplicates)?

Yes, this is a little unfortunate. We could potentially make a custom wrapper for use in IndexVariable._data on the level variabless that lazily computes values from the MultiIndex (similar to our LazilyIndexedArray class), but I'm not certain yet that this is necessary.

If indexes[dim] returns multiple Variables, which realizes a MultiIndex-like structure without pd.MultiIndex, indexes would be very different from dim,
because a single dimension can have multiple indexes.

Every entry in indexes should be a single pandas.Index or subclass, including MultiIndex (possibly eventually allowing for index-like objects such as something based on a KDTree).

@fujiisoup
Copy link
Member Author

Thanks for the details.
(Sorry for my late responce. It took a long for me to understand what does it look like.)

I am wondering what the advantageous cases which are realized with this Index concept are.
As far as my understanding is correct,

  1. It will enable more flexible indexing, e.g. more than one Indexes are associated with one dimension and we can select from these coordinate values very flexibly.
  2. It will naturally integrate more advanced Indexes such as KDTree

Are they correct?

Probably the most elegant rule would again be to check all indexed variables for exact matches.

That sounds reasonable.

In principle, this data model would allow for two mostly equivalent indexing schemes: MultiIndex[time, space] vs two indexes Index[time] and Index[space].

I like the latter one, as it is easier to understand even for non-pandas users.

What does the actual implementation look like?
xr.Dataset.indexes will be an OrderedDict that maps from variable's name to its associated dimension?
Actual instance of Index will be one of xr.Dataset.variables?

@shoyer
Copy link
Member

shoyer commented Oct 13, 2017

I am wondering what the advantageous cases which are realized with this Index concept are.

The other advantage is that it solves many of the issues with the current MultiIndex implementation. Making MultiIndex levels their own variables considerably simplifies the data model, and means that many features (including serialization) should "just work".

In principle, this data model would allow for two mostly equivalent indexing schemes: MultiIndex[time, space] vs two indexes Index[time] and Index[space].
I like the latter one, as it is easier to understand even for non-pandas users.

I agree, but there are probably some advantages to using a MultiIndex internally. For example, it allows for looking up on multiple levels at the same time.

What does the actual implementation look like?
xr.Dataset.indexes will be an OrderedDict that maps from variable's name to its associated dimension?
Actual instance of Index will be one of xr.Dataset.variables?

I think we could get away with making xr.Dataset.indexes simply a dict, with keys given by index names and values given by a pandas.Index instance. We should enforce that Index.name or MultiIndex.names corresponds to coordinate variables.

For KDTree, this means we'll have to write our own wrapper KDTreeIndex that adds a names property, but we would probably need to add special methods like get_indexer anyways.

@alimanfoo
Copy link
Contributor

Just to say I'm interested in how MultiIndexes are handled also. In our use case, we have two variables conventionally named CHROM (chromosome) and POS (position) which together describe a location in a genome. I want to combine both variables into a multi-index so I can, e.g., select all data from some data variable for chromosome X between positions 100,000-200,000. For all our data variables, this genome location multi-index would be used to index the first dimension.

@jjpr-mit
Copy link

Will the new API preserve the order of the levels? One of the features that's necessary for MultiIndex to be truly hierarchical is that there is a defined order to the levels.

@shoyer
Copy link
Member

shoyer commented Oct 27, 2017

@jjpr-mit can you explain your use case a little more? What sort of order dependent queries do you want to do? The one that comes to mind for me are range based queries, e.g, [('bar', 1) : ('foo', 9)].

I think it is still relatively easy to ensure a unique ordering between levels, based on the order of coordinate variables in the xarray dataset.

A bigger challenge is that for efficiency, these sorts of queries depend critically on having an actual MultiIndex. This means that if indexes for each of the levels arise from different arguments that were merged together, we might need to "merge" the separate indexes into a joint MultiIndex. This could potentially be slightly expensive.

@shoyer shoyer changed the title Future of MultiIndex Indexes as an explicit part of xarray's data-model (Future of MultiIndex) Jan 5, 2018
@shoyer shoyer changed the title Indexes as an explicit part of xarray's data-model (Future of MultiIndex) Explicit indexes in xarray's data-model (Future of MultiIndex) Jan 5, 2018
@shoyer shoyer modified the milestones: 0.10.1, 1.0 Jan 31, 2018
@Hoeze
Copy link

Hoeze commented Apr 19, 2021

Many array types do have implicit indices.
For example, sparse arrays do have their coordinates / CSR representation as primary index (.sel()) while dense array's primary index is the position (.isel()).
Every labeled dimension is therefore just a separate mapping of a string to the index position in the array.

Going one step further, one could have continuous dimensions where positional indexing (.isel()) does not really make sense.
Looking at TileDB's dimensions provides an example for this.

=> Having explicit and implicit indices on arrays would be awesome, even if they don't support all xarray features!

@weipeng1999
Copy link

weipeng1999 commented Oct 19, 2021

Well, maybe we can consider the coordinates in a more generic way.

Let us define coordinate an array in data set cause co-indexed when we index its data set. It means that:

  • If A1,A2,A3 are in a same data set S, we index S[ {'A1':I} ] will return a new data set which not only have indexed A1, but they also been Indexed that the A2 A3 which have dims shared with A1. This behavior I call it co-index.

Use dims to determined the way how other array of the data set will be co-indexed.

  • If all dims of A1(as coordinate) are also in A2(as regular array co-indexed), obviously the behavior can simply follow the old behavior, just change at the same dim and contain others.
  • If A1 has a dim which not in A2, we should broadcast A2 at the dim, because the older behavior is to consider None dim as broadcast-able dim during other operation so co-index should follow it.

Some compatibility issues:

  • maybe need a New Type like DataArray but only have dims instead of both dims and coordinate
  • just define how Dataset to deal with index, maybe DataArray is simlar.

@benbovy
Copy link
Member

benbovy commented Oct 19, 2021

Hi @weipeng1999,

I'm not sure to fully understand your suggestion, would you mind sharing some illustrative examples?

It is useful to have two distinct coordinate variable vs data variable concepts. Although both are data arrays, the former is used to locate data in the dimensional space(s) defined by all dimensions in the dataset while the latter is used to store field data.

It also helps to have a clear separation between the coordinate variable and index concepts. An index is a specific data structure or object that allows efficient data extraction or alignment based one or more coordinate labels. Sometimes an index object may be handled like a data array (like pandas indexes) but this is not always the case (e.g., a KD-Tree).

Currently in Xarray the index concept is hidden behind "dimension" coordinate variables. The goal of the explicit index refactor is to bring it to the light and make it available to any coordinate (and also open it to custom index structures, not only pandas indexes).

It looks like what you suggest is some kind of implicit (co-)indexes hidden behind any dataset variable(s)? We actually took the opposite direction, trying to make everything explicit.

@weipeng1999
Copy link

Hi @weipeng1999,

I'm not sure to fully understand your suggestion, would you mind sharing some illustrative examples?

It is useful to have two distinct coordinate variable vs data variable concepts. Although both are data arrays, the former is used to locate data in the dimensional space(s) defined by all dimensions in the dataset while the latter is used to store field data.

It also helps to have a clear separation between the coordinate variable and index concepts. An index is a specific data structure or object that allows efficient data extraction or alignment based one or more coordinate labels. Sometimes an index object may be handled like a data array (like pandas indexes) but this is not always the case (e.g., a KD-Tree).

Currently in Xarray the index concept is hidden behind "dimension" coordinate variables. The goal of the explicit index refactor is to bring it to the light and make it available to any coordinate (and also open it to custom index structures, not only pandas indexes).

It looks like what you suggest is some kind of implicit (co-)indexes hidden behind any dataset variable(s)? We actually took the opposite direction, trying to make everything explicit.

Try to explain my idea, I make a PPT.

图片
图片
图片
图片
图片
图片

@benbovy
Copy link
Member

benbovy commented Oct 22, 2021

Thanks for the detailed description @weipeng1999. For the first 4 slides I don't see how this is different from how does S_res = S1.sel(C1=['a', 'b'] and S_res = S2.sel(C1=['a', 'b']) currently? And for the last 2 slides, I don't think that we always want such implicit broadcasting for dimensions that are not involved in the indexed coordinates.

@weipeng1999
Copy link

Thanks for the detailed description @weipeng1999. For the first 4 slides I don't see how this is different from how does S_res = S1.sel(C1=['a', 'b'] and S_res = S2.sel(C1=['a', 'b']) currently? And for the last 2 slides, I don't think that we always want such implicit broadcasting for dimensions that are not involved in the indexed coordinates.

thank you for figuring out the wrong things what I done. Well, it' is hard to explain the idea because it is a bit complicated, the last two picture is wrong and make misunderstanding, here are two images explain what I actuarily mean:
image
image

@benbovy
Copy link
Member

benbovy commented Oct 22, 2021

Sorry but this is confusing. To me It still looks like you want implicit broadcasting of the A3 variable along the x dimension. In your last comment you depict A3 inconsistently with a 2-d shape but with only the t dimension. I'm also not sure how your suggestion relates to the issue here.

@weipeng1999
Copy link

weipeng1999 commented Oct 22, 2021

well, here are my ideas on how to define coordinates with multi dims.(because of github's bug, the characters of 1st image are white, I can not fix it)
image
image
image
image

@benbovy
Copy link
Member

benbovy commented Oct 22, 2021

For such case you could already do ds.stack(z=("t", "x")).set_index(z="C2").sel(z=["a", "e", "h"]).

After the explicit index refactor, we could imagine a custom index that supports multi-dimension coordinates such that you would only need to do something like

>>> S_res = S4.sel(C2=("z", ["a", "e", "h"]))
>>> S_res
<xarray.Dataset>
Dimensions:  (z: 3)
Coordinates:
  * C2        (z) <U1 'a' 'e' 'h'
Data variables:
    A1        (z) float64 4 3 3

or without explicitly providing the name of the packed dimension:

>>> S_res = S4.sel(C2=["a", "e", "h"])
>>> S_res
<xarray.Dataset>
Dimensions:  (C2: 3)
Coordinates:
  * C2        (C2) <U1 'a' 'e' 'h'
Data variables:
    A1        (C2) float64 4 3 3

@weipeng1999
Copy link

For such case you could already do ds.stack(z=("t", "x")).set_index(z="C2").sel(z=["a", "e", "h"]).

After the explicit index refactor, we could imagine a custom index that supports multi-dimension coordinates such that you would only need to do something like

>>> S_res = S4.sel(C2=("z", ["a", "e", "h"]))
>>> S_res
<xarray.Dataset>
Dimensions:  (z: 3)
Coordinates:
  * C2        (z) <U1 'a' 'e' 'h'
Data variables:
    A1        (z) float64 4 3 3

or without explicitly providing the name of the packed dimension:

>>> S_res = S4.sel(C2=["a", "e", "h"])
>>> S_res
<xarray.Dataset>
Dimensions:  (C2: 3)
Coordinates:
  * C2        (C2) <U1 'a' 'e' 'h'
Data variables:
    A1        (C2) float64 4 3 3

well, both "contain the origin dims" or just "generate another one" have its benefit.
if we contain origin dims, we can ensure that:

  • less difference between 1d coordinate and multi dims ones, both can run like S1.sel(C1=["a", "e", "h"]) S4.sel(C2=["a", "e", "h"]) and return a new data set with origin dims ( that's why I highly not recommended the implicit one )
  • return a new data set have original dims which means if you change C1 to C2, and the rest code have S_res.sel(x=[1,2,3]) still work.

@weipeng1999
Copy link

So I think maintain the origin dims may do less broken on current code.

@benbovy
Copy link
Member

benbovy commented Oct 22, 2021

well, both "contain the origin dims" or just "generate another one" have its benefit.

Agreed, and both are supported by xarray actually. In case we want to keep the original dimensions like ("x", "y") in the example above, it's better to use masking.

This discussion is broader than the topic covered in this issue so I'd suggest you start a new discussion if you want to further discuss this with the xarray community. Thanks.

@benbovy
Copy link
Member

benbovy commented Sep 27, 2022

Should we close this issue and continue the discussion in #6293?

For anyone who wants to track the progress on this topic: https://github.com/pydata/xarray/projects/1

@benbovy benbovy closed this as completed Sep 28, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests