Many methods are broken (e.g., concat/stack/sortby) when using repeated dimensions #1378

fsteinmetz · 2017-04-19T09:32:54Z

Concatenating DataArrays with repeated dimensions does not work.

import xarray as xr  #xarray 0.8.2
from numpy import eye
A = xr.DataArray(eye(3), dims=['dim0', 'dim0'])
xr.concat([A, A], 'newdim')

fails with

[...]
ValueError: axes don't match array

The text was updated successfully, but these errors were encountered:

fmaussion · 2017-04-19T09:46:37Z

Yes, also happening on latest master.

I suspect there are several other things which won't work properly (or at least unexpectedly) when having repeated dims...

shoyer · 2017-04-19T20:39:33Z

Indeed, we don't have very good test coverage for operations with repeated dimensions. Fixes would certainly be appreciated, though they might be somewhat tricky. Even failing loudly with ValueError: repeated dimensions not yet supported would be an improvement over the current state.

fsteinmetz · 2017-04-20T06:11:02Z

Right, also positional indexing works unexpectedly in this case, though I understand it's tricky and should probably be discouraged:

A[0,:]  # returns A
A[:,0]  # returns A.isel(dim0=0)

fmaussion · 2017-04-20T07:47:18Z

I guess it would be good to document the expected behaviour with repeated dims somewhere? I.e. what should happen when doing:

a = xr.DataArray(eye(3), dims=['dim0', 'dim0'])
a.mean(dim='dim0')

?

fsteinmetz · 2017-04-20T09:50:19Z

I cannot see a use case in which repeated dims actually make sense.

In my case this situation originates from h5 files which indeed contains repeated dimensions (variables(dimensions): uint16 B0(phony_dim_0,phony_dim_0), ..., uint8 VAA(phony_dim_1,phony_dim_1)), thus xarray is not to blame here.
These are "dummy" dimensions, not associated with physical values. What we do to circumvent this problem is "re-dimension" all variables.
Maybe a safe approach would be for open_dataset to raise a warning by default when encountering such variables, with possibly an option to perform automatic or custom dimension naming to avoid repeated dims.
I also agree with @shoyer that failing loudly when operating on such DataArrays instead of providing confusing results would be an improvement.

fmaussion · 2017-04-20T12:11:18Z

In my case this situation originates from h5 files which indeed contains repeated dimensions

Yes this happened to me too. First thing I did is converting the files to proper netcdf datasets...

shoyer · 2017-04-20T15:59:40Z

I cannot see a use case in which repeated dims actually make sense.

Agreed. I would have disallowed them entirely, but sometimes it's useful to allow loading variables with duplicate dimensions, even if the only valid operation you can do is de-duplicate them.

Every routine that looks up dimensions by name should go through the get_axis_num method. That would be a good place to add a check for uniqueness.

gerritholl · 2018-02-20T22:46:27Z

I cannot see a use case in which repeated dims actually make sense.

I use repeated dimensions to store a covariance matrix. The data variable containing the covariance matrix has 4 dimensions, of which the last 2 are repeated. For example, I have a data variable with dimensions (channel, scanline, element, element), storing an element-element covariance matrix for every scanline in satellite data.

This is valid NetCDF and should be valid in xarray. It would be a significant problem for me if they became disallowed.

jhamman · 2018-02-20T23:03:54Z

@gerritholl - rereading this issue, I don't think we're particularly opposed to supporting duplicate dimensions. We do know there are things that don't work right now and that we don't have test coverage for operations that use duplicate dimensions.

This is marked as a help wanted issue and I suspect that if someone like yourself, who has a use case for this functionality, were to want to work on this issue, we'd be happy to see it move forward.

gerritholl · 2018-02-20T23:10:13Z

@jhamman Ok, good to hear it's not slated to be removed. I would love to work on this, I wish I had the time! I'll keep it in mind if I do find some spare time.

gerritholl · 2018-03-26T09:38:00Z

This also affects the stack method.

Hoeze · 2018-07-18T14:39:04Z

Annotating distance matrices with xarray is not possible as well due to the duplicate dimension.

gimperiale · 2019-09-06T16:22:12Z

I'm not too fond of having multiple dimensions with the same name because, whenever you need to operate on one but not the other, you have little to no choice but revert to positional indexing.

Consider also how many methods expect either **kwargs or a dict-like parameter with the dimension or variable names as the keys. I would not be surprised to find that many API design choices fall apart in the face of this use case.

Also, having two non positional (as it should always be in xarray!) dimensions with the same name only makes sense when modelling symmetric N:N relationships. Two good examples are covariance matrices and the weights for a Dijkstra algorithm.

The problems start when the object represents an asymmetric relationship, e.g:

Cost (for the purpose of graph resolution, so time/money/other) of transportation via river, where going from A->B (downstream) is cheaper than going back from B->A (upstream)
Currency conversion, where EUR->USD is not identical to 1/(USD->EUR) because of arbitrage and illiquidity
In financial Monte Carlo simulations, I had to deal with credit rating transition matrices which define the probability of a company to change its credit rating. In unfavourable market conditions, the chances of being downgraded from AAA to AA are higher than being promoted from AA to AAA.

I could easily come up with many other cases.
In case of asymmetric N:N relationships, it is highly desirable to share the same index across multiple dimensions with different names (that would typically convey the direction of the relationship, e.g. "from" and "to").

What if, instead of allowing for duplicate dimensions, we allowed sharing an index across different dimensions?

Something like

river_transport = Dataset(
    coords={
        'station': ['Kingston', 'Montreal'],
        'station_from': ('station', )
        'station_to': ('station', )
    },
    data_vars={
        cost=(('station_from', 'station_to'), [[0, 20], [15, 0]]),
    }
}

or, for DataArrays:

river_transport = DataArray(
    [[0, 20], [15, 0]],
    dims=('station_from', 'station_to'),
    coords={
        'station': ['Kingston', 'Montreal'],
        'station_from': ('station', )
        'station_to': ('station', )
    },
}

Note how this syntax doesn't exist as of today:

        'station_from': ('station', )
        'station_to': ('station', )

From an implementation point of view, I think it could be easily implemented by keeping track of a map of aliases and with some __geitem__ magic. More effort would be needed to convince DataArrays to accept (and not accidentally drop) a coordinate whose dims don't match any of the data variable's.

This design would not resolve the issue of compatibility with NetCDF though.
I'd be surprised if the NetCDF designers never came across this - maybe it's a good idea to have a chat with them?

According to xarray issues: pydata/xarray#3286 pydata/xarray#1378 The open_mfdataset function has problems in creating a merged dataset from multiple files in which variables have repeated dimension names. The easiest thing to do in this case is to prevent such variables from being read in. We now have added the drop_variables keyword to avoid reading in the "anchor" variable in all calls to open_dataset and open_mfdataset in both benchmark.py and core.py. This variable is only present in GCHP-created netCDF files using MAPL v1.0.0, which is in GCHP 12.5.0 and later. This commit should resolve GCPy issue #26: #26 Signed-off-by: Bob Yantosca <[email protected]>

mangecoeur · 2020-03-23T19:02:26Z

Just wondering what the status of this is. I've been running into bugs trying to model symmetric distance matrices using the same dimension. Interestingly, it does work very well for selecting, e.g. if use .sel(nodes=node_list) on a square matrix i correctly get a square matrix subset 👍 But unfortunately a lot of other things seems to break, e.g. concatenating fails with
ValueError: axes don't match array :( What would need to happen to make this work?

benbovy · 2023-08-29T10:49:40Z

Alternatively to @gimperiale's suggestion of coordinate aliases, I wonder if a solution could be extending xarray.Variable and xarray.DataArray such that:

da = DataArray(
    [[0, 20], [15, 0]],
    dims=(('station', 'from'), ('station', 'to')),
    coords={'station': ['Kingston', 'Montreal']},
)

da.dims
# ('station', 'station')

da.dims_full
# (('station', 'from'), ('station', 'to'))

da2 = DataArray([[1, 2], [3, 4]], dims=('x', 'y'))

da2.dims
# ('x', 'y')

da2.dims_full
# ('x', 'y')

we keep the .dims property for backwards compatibility, serialization, label-selection,... and also because semantically it makes sense in the example above that ('station', 'from') and ('station', 'to') are the same dimension
we can easily and gradually switch to using the .dims_full property within problematic methods like concat, stack, etc.

Repr could look like:

da
# <xarray.DataArray (station[from]: 2, station[to]: 2)>
# array([[ 0, 20],
#        [15,  0]])
# Coordinates:
#   * station  (station) <U8 'Kingston' 'Montreal'

da.to_dataset(name='river_transport')
# <xarray.Dataset>
# Dimensions:          (station: 2)
# Coordinates:
#   * station          (station) <U8 'Kingston' 'Montreal'
# Data variables:
#     river_transport  (station[from], station[to]) int64 0 20 15 0

benbovy · 2023-08-29T13:36:07Z

Some more food for thoughts.

Instead of extending the Xarray data model we could now leverage Xarray flexible indexes and provide some utility methods like this:

da = DataArray(
    [[0, 20], [15, 0]],
    dims=('station', 'station'),
    coords={'station': ['Kingston', 'Montreal']},
)

da
# <xarray.DataArray (station: 2)>
# array([[ 0, 20],
#        [15,  0]])
# Coordinates:
#   * station  (station) <U8 'Kingston' 'Montreal'

da_split = da.split_repeated_dims(station=('station_from', 'station_to'))
da_split
# <xarray.DataArray (station_from: 2, station_to: 2)>
# array([[ 0, 20],
#        [15,  0]])
# Coordinates:
#   * station_from    (station_from) <U8 'Kingston' 'Montreal'
#   * station_to      (station_to) <U8 'Kingston' 'Montreal'
#   * station         (station_from, station_to) object ...
# Indexes:
#   ┌ station_from    RepeatedIndex
#   │ station_to
#   └ station

da_merged = da_split.merge_repeated_index('station')
# <xarray.DataArray (station: 2)>
# array([[ 0, 20],
#        [15,  0]])
# Coordinates:
#   * station  (station) <U8 'Kingston' 'Montreal'

Where RepeatedIndex is a lightweight Xarray index wrapper that would encapsulate the index assigned to the "station" coordinate in the original DataArray such that it is reused for all coordinates "station", "station_from" and "station_to" (all sharing the same underlying data).

The coordinate da_split.station would be some sort of n-dimensional lazy coordinate holding tuple values, not really useful per se but that would still allow label-based selection for all repeated dimensions at once (a bit similar although not exactly the same as how we represent the dimension coordinate of a PandasMultiIindex):

da_split.station.data
# some custom lazy duck-array object

da_split.station.values
# [[('Kingston', 'Kingston') ('Kingston', 'Montreal')]
#  [('Montreal', 'Kingston') ('Montreal', 'Montreal')]]

So RepeatedIndex could support three kinds of selection (maybe more):

da_split.sel(station="Kingston")   # shorthand for station=('Kingston', 'Kingston')
# <xarray.DataArray ()>
# array(0)
# Coordinates:
#     station  <U8 'Kingston'

da_split.sel(station_from="Kingston")
# array([ 0, 20])
# <xarray.DataArray (station_to: 2)>
# Coordinates:
#   * station_to     (station_to) <U8 'Kingston' 'Montreal'
#     station_from   <U8 'Kingston'

da_split.sel(station_to="Kingston")
# array([ 0, 15])
# <xarray.DataArray (station_from: 2)>
# Coordinates:
#   * station_from    (station_from) <U8 'Kingston' 'Montreal'
#     station_to      <U8 'Kingston'

Now regarding the concat example mentioned in the top comment of this issue, there's an extra step required:

da_split = da.split_repeated_dims(station=('station_from', 'station_to'))
da_concat = xr.concat([da_split, da_split], 'newdim')
da_result = da_concat.merge_repeated_index('station')

Overall this would be a pretty nice and explicit way to solve the issue of repeated dimensions, IMHO. It is 100% compatible with the current Xarray data model, which might be preferred for such a special case.

EDIT: sorry for the multiple edits!

benbovy · 2023-08-30T07:36:37Z

Regarding the suggestion in my previous comment, perhaps it is not even worth having a RepeatedIndex and it might just be enough providing two utility methods DataArray.split_dims and DataArray.merge_dims:

da_split = da.split_dims(station=('station_from', 'station_to'))
# <xarray.DataArray (station_from: 2, station_to: 2)>
# array([[ 0, 20],
#        [15,  0]])
# Coordinates:
#   * station_from    (station_from) <U8 'Kingston' 'Montreal'
#   * station_to      (station_to) <U8 'Kingston' 'Montreal'

da_merged = da_split.merge_dims(station=('station_from', 'station_to'))
# <xarray.DataArray (station: 2)>
# array([[ 0, 20],
#        [15,  0]])
# Coordinates:
#   * station  (station) <U8 'Kingston' 'Montreal'

Converting between the two forms is very cheap so it can be done as many times as desired. It involves just a few operations on metadata and shallow copies of Xarray variable and index objects (+ maybe some index identity and/or equality checks for merge_dims). The same underlying index (e.g., pd.Index) is already shared by all the indexed dimension coordinates in both forms.

shoyer added bug contrib-help-wanted labels Apr 19, 2017

shoyer mentioned this issue Mar 8, 2018

einsum for xarray #1968

Merged

4 tasks

shoyer mentioned this issue Jul 19, 2018

Some simple broadcast_dim method? #2267

Closed

floogit mentioned this issue Nov 28, 2018

sortby(dim) removes dimension from an array if dim is a repeated dimension #2579

Closed

shoyer changed the title ~~concat fails when using repeated dimensions~~ Many methods are broken (e.g., concat/stack/sortby) when using repeated dimensions Nov 28, 2018

dcherian mentioned this issue Sep 6, 2019

open_mfdataset cannot open multiple netcdf files written by NASA/GEOS MAPL v1.0.0 that contain data on a cubed-sphere grid #3286

Closed

yantosca mentioned this issue Sep 6, 2019

[BUG/ISSUE] xarray open_mfdataset cannot open multiple GCHP diagnostic/restart files created with MAPL v1.0.0 geoschem/gcpy#26

Closed

jvail mentioned this issue Apr 23, 2021

Batch runs - merging output datasets fredboudon/vmango-lab#21

Closed

ivirshup mentioned this issue Nov 22, 2022

Repeated coordinates leads to unintuitive (broken?) indexing behaviour #3731

Open

dcherian mentioned this issue Nov 30, 2023

Warn on repeated dimension names during construction #8491

Merged

4 tasks

fwfichtner mentioned this issue Jan 2, 2024

Allow .chunk for datasets with duplicated dimension names, e.g. Sentinel-3 OLCI files #8579

Closed

dcherian mentioned this issue Jun 13, 2024

Support duplicate dimensions in .chunk #9099

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Many methods are broken (e.g., concat/stack/sortby) when using repeated dimensions #1378

Many methods are broken (e.g., concat/stack/sortby) when using repeated dimensions #1378

fsteinmetz commented Apr 19, 2017 •

edited

Loading

fmaussion commented Apr 19, 2017

shoyer commented Apr 19, 2017 •

edited

Loading

fsteinmetz commented Apr 20, 2017

fmaussion commented Apr 20, 2017

fsteinmetz commented Apr 20, 2017 •

edited

Loading

fmaussion commented Apr 20, 2017

shoyer commented Apr 20, 2017

gerritholl commented Feb 20, 2018

jhamman commented Feb 20, 2018

gerritholl commented Feb 20, 2018

gerritholl commented Mar 26, 2018

Hoeze commented Jul 18, 2018

gimperiale commented Sep 6, 2019

mangecoeur commented Mar 23, 2020

benbovy commented Aug 29, 2023 •

edited

Loading

benbovy commented Aug 29, 2023 •

edited

Loading

benbovy commented Aug 30, 2023

Many methods are broken (e.g., concat/stack/sortby) when using repeated dimensions #1378

Many methods are broken (e.g., concat/stack/sortby) when using repeated dimensions #1378

Comments

fsteinmetz commented Apr 19, 2017 • edited Loading

fmaussion commented Apr 19, 2017

shoyer commented Apr 19, 2017 • edited Loading

fsteinmetz commented Apr 20, 2017

fmaussion commented Apr 20, 2017

fsteinmetz commented Apr 20, 2017 • edited Loading

fmaussion commented Apr 20, 2017

shoyer commented Apr 20, 2017

gerritholl commented Feb 20, 2018

jhamman commented Feb 20, 2018

gerritholl commented Feb 20, 2018

gerritholl commented Mar 26, 2018

Hoeze commented Jul 18, 2018

gimperiale commented Sep 6, 2019

mangecoeur commented Mar 23, 2020

benbovy commented Aug 29, 2023 • edited Loading

benbovy commented Aug 29, 2023 • edited Loading

benbovy commented Aug 30, 2023

fsteinmetz commented Apr 19, 2017 •

edited

Loading

shoyer commented Apr 19, 2017 •

edited

Loading

fsteinmetz commented Apr 20, 2017 •

edited

Loading

benbovy commented Aug 29, 2023 •

edited

Loading

benbovy commented Aug 29, 2023 •

edited

Loading