Inferring concatenation order from coordinate data values #18

TomNicholas · 2024-03-09T15:47:18Z

If xr.combine_by_coords is used, then indexes must be created for dimension coordinates in order to use them to infer the correct order of concatenation. This requires loading of the data for that coordinate in order to create the index. We can do this, but I think we will then end up with a Dataset which contains some ManifestArrays but also some numpy arrays for the dimension coordinates. This isn't itself a problem - we have that information, we can load it from the files.

The tricky part is that we presumably don't want to write those actual data values back out during the serialization step, we want to only write the chunk manifest for each dimension coordinate. How can we regain access to this information, or carry it around ready for serialization at the end?

The text was updated successfully, but these errors were encountered:

TomNicholas · 2024-03-11T16:29:22Z

My understanding (/ my recollection 😅 ) of the code in xr.combine_by_coords is that it infers concatenation order from the pandas indexes backing any 1D coordinate variables. So as long as those indexes are present and correct, the datasets can be ordered. So perhaps we could just manually create those indexes when we create the dataset (which requires loading data from the files), so the indexes are eager (i.e. how they normally work) but the corresponding array data is virtualized.

This would mean we would need to both read the kerchunk references and actually read data values from the file at dataset construction time. We would want to have a function flag to opt in to this behaviour.

TomNicholas · 2024-03-27T20:46:38Z

Okay I prematurely closed this, as it was not actually as simple as it seemed.

perhaps we could just manually create those indexes when we create the dataset (which requires loading data from the files), so the indexes are eager (i.e. how they normally work) but the corresponding array data is virtualized.

This would mean we would need to both read the kerchunk references and actually read data values from the file at dataset construction time.

This is what I implemented in #52, but I didn't remember to check this part:

The tricky part is that we presumably don't want to write those actual data values back out during the serialization step, we want to only write the chunk manifest for each dimension coordinate.

Turns out that doesn't work yet - at some point within xr.combine_by_coords the coordinate variable is being recreated from the index as a numpy array, even though the original coordinate variable was a virtual ManifestArray. Hopefully this can be fixed with another change to xarray upstream similar to pydata/xarray#8872.

In the meantime I should remove that section from the docs so as not to commit false advertising! EDIT: Removed in 2c5be3f

cc @norlandrhagen

benbovy · 2024-03-28T14:52:49Z

So perhaps we could just manually create those indexes when we create the dataset (which requires loading data from the files), so the indexes are eager (i.e. how they normally work) but the corresponding array data is virtualized.

In general I wonder if it would be useful to have some custom Xarray index here. Basically a simple subclass or a very lightweight wrapper around xarray.indexes.PandasIndex where the .create_variables() method is re-implemented such that the indexed coordinate variable doesn't wrap the pandas.Index object but returns a variable still wrapping the original ManifestArray (IIUC).

we presumably don't want to write those actual data values back out during the serialization step

Would such decoupling of the pandas.Index from the indexed coordinate data solve this?

The tricky part is that we still need to refactor Xarray IndexVariable, which is still too tied to pandas.Index and currently prevents implementing lazy custom indexes (see pydata/xarray#8124). Perhaps as a temporary workaround we could use a regular Variable for the indexed coordinate instead of an IndexVariable...

dcherian · 2024-03-28T15:00:05Z

we presumably don't want to write those actual data values back out during the serialization step,

TWriting would speed things up reading from the consolidated files. Kerchunk calls this "inlining". For e.g. you wouldn't want read ~90k time steps from 90k files to construct a 90K long time coordinate

TomNicholas · 2024-03-28T15:00:46Z

a variable still wrapping the original ManifestArray (IIUC).

Would such decoupling of the pandas.Index from the indexed coordinate data solve this?

Yeah that sounds like what we need! I basically need the ManifestArray inside the coordinate variable to still exist after xarray concat etc. operations, so that I can write it to disk. So I need the coordinate data to still wrap the original duck array, without having coerced it or dropped it in favour of a numpy array created from the pd.Index.

Perhaps as a temporary workaround we could use a regular Variable for the indexed coordinate instead of an IndexVariable...

I think this is what I was hoping I could achieve with pydata/xarray#8872, but clearly I was missing some subtleties with IndexVariable vs Variable.

In general I wonder if it would be useful to have some custom Xarray index here.

Naively, I don't really understand why this use case requires a custom Xarray index. I am still happy to use PandasIndex for the index class (i.e. not custom), I just want my wrapped coordinate variable data to not be messed with.

TomNicholas · 2024-03-28T15:04:05Z

TWriting would speed things up reading from the consolidated files. Kerchunk calls this "inlining". For e.g. you wouldn't want read ~90k time steps from 90k files to construct a 90K long time coordinate

@dcherian if we have retained both the index and the ManifestArray, wouldn't the choice of whether or not to "inline" be something that can be deferred to the write step?

dcherian · 2024-03-28T15:08:49Z

IIUC yes.

benbovy · 2024-03-28T15:12:29Z

Naively, I don't really understand why this use case requires a custom Xarray index. I am still happy to use PandasIndex for the index class (i.e. not custom), I just want my wrapped coordinate variable data to not be messed with.

Xarray indexes have full control on the data wrapped by their corresponding indexed coordinates, via Index.create_variables().

Currently PandasIndex.create_variables() creates and returns a new IndexVariable backed by the pandas.Index to prevent duplicating data in memory. In most cases this is probably what we want but not here IIUC, so I'm afraid you'll need some custom index if you want to override that current behavior.

Actually, it might be enough to have a custom index as a lightweight wrapper around any xarray index (PandasIndex or other) where create_variables doesn't replace the original coordinate variable data.

TomNicholas · 2024-03-28T15:16:45Z

Okay thanks. So implementing such an index would be essentially involve subclassing PandasIndex but overriding .create_variables()?

Actually, it might be enough to have a custom index as a lightweight wrapper around any xarray index (PandasIndex or other) where create_variables doesn't replace the original coordinate variable data.

This seems like something that might be useful in general, a sort of InaccessibleDuckArrayIndex.

benbovy · 2024-03-28T15:21:22Z

Alternatively, we could add a build option to PandasIndex to opt out from the default behavior. That would be perfectly reasonable IMO. No need for any subclass or wrapper then, unless you'd like to also support custom indexes in VirtualiZarr (if that's relevant).

TomNicholas · 2024-03-28T15:24:16Z

Alternatively, we could add a build option to PandasIndex to opt out from the default behavior.

A build option as in a new kwarg to .create_variables() or something? How would I opt in to that?

unless you'd like to also support custom indexes in VirtualiZarr (if that's relevant).

I don't think it is? Currently the only reason I want these indexes is so that xr.combine_by_coords can look at them to infer the correct order of concatenation of the datasets. But there might be some future feature that would require a custom index...

benbovy · 2024-03-28T15:35:34Z

A build option as in a new kwarg to .create_variables() or something?

Rather a new argument to PandasIndex.__init__(..., wrap_index_data=True) such that we could opt out with something like ds.set_xindex("x", PandasIndex, wrap_index_data=False).

Currently the only reason I want these indexes is so that xr.combine_by_coords can look at them to infer the correct order of concatenation of the datasets. But there might be some future feature that would require a custom index...

Yeah I don't know. Maybe to eventually allow users manually setting custom indexes and doing some domain-specific operations before "saving" the resulting virtual dataset?

TomNicholas · 2024-03-28T15:41:25Z

Rather a new argument to PandasIndex.__init__(..., wrap_index_data=True) such that we could opt out with something like ds.set_xindex("x", PandasIndex, wrap_index_data=False).

Both this and the subclassing-PandasIndex approach seem like good general solutions.

Yeah I don't know. Maybe to eventually allow users manually setting custom indexes and doing some domain-specific operations before "saving" the resulting virtual dataset?

Yeah quite possibly. One thing at a time though! I appreciate your input here.

benbovy · 2024-03-28T15:42:46Z

One thing at a time though!

Yes sure! :-)

TomNicholas · 2024-03-29T14:26:37Z

@benbovy do we actually need a subclass / build option at all? Can't we just change the implementation of PandasIndex to say "If the wrapped variable contains a numpy array, avoid duplicating data (i.e. wrap_index_data=True). But if it contains any other duck array, return that instead unaltered (i.e. wrap_index_data=False)." So basically make the currently implementation a memory optimization that only gets applied if the PandasIndex sees that the variable is wrapping a numpy array. Are there any cases where that approach would fail?

benbovy · 2024-03-29T14:53:56Z

Are there any cases where that approach would fail?

I guess in theory such memory optimization might make sense for other, in-memory duck arrays? In practice I don't know much.

Exposing some control to advanced users for this behavior a-priori doesn't seem like a bad idea to me.

TomNicholas · 2024-07-01T15:45:42Z

I realized that xr.combine_by_coords should actually already work fine - if you are willing to load the relevant coordinates into memory (and therefore also have those values saved into the resultant references on-disk).

In [1]: from virtualizarr import open_virtual_dataset

In [2]: import xarray as xr

In [3]: ds1 = open_virtual_dataset('air1.nc', loadable_variables=['time', 'lat', 'lon'], cftime_variables=['time'])

In [4]: ds2 = open_virtual_dataset('air2.nc', loadable_variables=['time', 'lat', 'lon'], cftime_variables=['time'])

In [5]: xr.combine_by_coords([ds2, ds1], coords='minimal', compat='override')
Out[5]: 
<xarray.Dataset> Size: 8MB
Dimensions:  (time: 2920, lat: 25, lon: 53)
Coordinates:
  * lat      (lat) float32 100B 75.0 72.5 70.0 67.5 65.0 ... 22.5 20.0 17.5 15.0
  * lon      (lon) float32 212B 200.0 202.5 205.0 207.5 ... 325.0 327.5 330.0
  * time     (time) float32 12kB 1.867e+06 1.867e+06 ... 1.885e+06 1.885e+06
Data variables:
    air      (time, lat, lon) int16 8MB ManifestArray<shape=(2920, 25, 53), d...
Attributes:
    Conventions:  COARDS
    description:  Data is from NMC initialized reanalysis\n(4x/day).  These a...
    platform:     Model
    references:   http://www.esrl.noaa.gov/psd/data/gridded/data.ncep.reanaly...
    title:        4x daily NMC reanalysis (1948)

the only issue with this result is that somehow the dtype of time has been changed from datetime64[ns] to float32.

This approach doesn't solve the original issue, but it might also be fine in a lot of cases. xr.combine_by_coords can only auto-order along dimensions that have a 1D coordinate, and 1D variables are small, so if they are split across many files its likely that you wanted to include them in loadable_variables anyway.

* add warning when user passes indexes=None * expect warnings in tests * new test to show how #18 doesn't fully work yet * release notes * lint

TomNicholas mentioned this issue Mar 10, 2024

Initial release checklist #2

Closed

15 tasks

TomNicholas changed the title ~~Virtualizing coordinates that were used to infer concat order~~ Inferring concatenation order from coordinate data values Mar 11, 2024

TomNicholas mentioned this issue Mar 27, 2024

open_virtual_dataset with and without indexes #52

Merged

TomNicholas closed this as completed in #52 Mar 27, 2024

TomNicholas reopened this Mar 27, 2024

TomNicholas added a commit that referenced this issue Mar 27, 2024

remove incorrect part of docs, see issue #18

2c5be3f

This was referenced Mar 27, 2024

Avoid auto creation of indexes in concat pydata/xarray#8872

Merged

Trying to write combined virtual dataset (for MUR SST) results in TypeError: Can only serialize wrapped arrays... #60

Closed

TomNicholas mentioned this issue Mar 28, 2024

"Inlining" data when writing references to disk #62

Open

TomNicholas mentioned this issue Apr 18, 2024

Regular (linspace) Coordinates/Index pydata/xarray#8473

Open

TomNicholas mentioned this issue May 10, 2024

Required changes in xarray to avoid creating indexes #14

Closed

TomNicholas added the xarray Requires changes to xarray upstream label May 16, 2024

TomNicholas mentioned this issue Jun 8, 2024

Aspirational use case: [C]Worthy mCDR OAE Atlas dataset #132

Open

19 tasks

TomNicholas mentioned this issue Jul 1, 2024

combine_by_coords with loaded cftime variable demotes dtype #168

Closed

TomNicholas added a commit that referenced this issue Jul 1, 2024

new test to show how #18 doesn't fully work yet

d5b12b6

TomNicholas mentioned this issue Jul 1, 2024

Warn on index creation #170

Merged

7 tasks

TomNicholas added a commit that referenced this issue Jul 2, 2024

Warn on index creation (#170)

2822961

* add warning when user passes indexes=None * expect warnings in tests * new test to show how #18 doesn't fully work yet * release notes * lint

This was referenced Aug 7, 2024

Loading data from ManifestArrays without saving references to disk first #124

Open

Use case: UCLA-ROMS fluid HPC model output #217

Open

TomNicholas mentioned this issue Aug 20, 2024

open_virtual_dataset with dmr++ #113

Merged

6 tasks

TomNicholas mentioned this issue Oct 2, 2024

Nested HDF5 Data / HEC-RAS fsspec/kerchunk#490

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inferring concatenation order from coordinate data values #18

Inferring concatenation order from coordinate data values #18

TomNicholas commented Mar 9, 2024 •

edited

Loading

TomNicholas commented Mar 11, 2024 •

edited

Loading

TomNicholas commented Mar 27, 2024 •

edited

Loading

benbovy commented Mar 28, 2024

dcherian commented Mar 28, 2024 •

edited

Loading

TomNicholas commented Mar 28, 2024

TomNicholas commented Mar 28, 2024

dcherian commented Mar 28, 2024

benbovy commented Mar 28, 2024 •

edited

Loading

TomNicholas commented Mar 28, 2024

benbovy commented Mar 28, 2024

TomNicholas commented Mar 28, 2024

benbovy commented Mar 28, 2024 •

edited

Loading

TomNicholas commented Mar 28, 2024

benbovy commented Mar 28, 2024 •

edited

Loading

TomNicholas commented Mar 29, 2024

benbovy commented Mar 29, 2024

TomNicholas commented Jul 1, 2024

Inferring concatenation order from coordinate data values #18

Inferring concatenation order from coordinate data values #18

Comments

TomNicholas commented Mar 9, 2024 • edited Loading

TomNicholas commented Mar 11, 2024 • edited Loading

TomNicholas commented Mar 27, 2024 • edited Loading

benbovy commented Mar 28, 2024

dcherian commented Mar 28, 2024 • edited Loading

TomNicholas commented Mar 28, 2024

TomNicholas commented Mar 28, 2024

dcherian commented Mar 28, 2024

benbovy commented Mar 28, 2024 • edited Loading

TomNicholas commented Mar 28, 2024

benbovy commented Mar 28, 2024

TomNicholas commented Mar 28, 2024

benbovy commented Mar 28, 2024 • edited Loading

TomNicholas commented Mar 28, 2024

benbovy commented Mar 28, 2024 • edited Loading

TomNicholas commented Mar 29, 2024

benbovy commented Mar 29, 2024

TomNicholas commented Jul 1, 2024

TomNicholas commented Mar 9, 2024 •

edited

Loading

TomNicholas commented Mar 11, 2024 •

edited

Loading

TomNicholas commented Mar 27, 2024 •

edited

Loading

dcherian commented Mar 28, 2024 •

edited

Loading

benbovy commented Mar 28, 2024 •

edited

Loading

benbovy commented Mar 28, 2024 •

edited

Loading

benbovy commented Mar 28, 2024 •

edited

Loading