Low memory/out-of-core index? #1650

alimanfoo · 2017-10-23T11:13:06Z

Has anyone considered implementing an index for monotonic data that does not require loading all values into main memory?

Motivation: We have data where first dimension can be length ~100,000,000, and coordinates for this dimension are stored as 32-bit integers. Currently if we used a pandas Index this would cast to 64-bit integers, and the index would require ~1GB RAM. This isn't enormous, but isn't negligible for people working on modest computers. Our use cases are simple, typically we only ever need to locate a slice of this dimension from a pair of coordinates, i.e., we only need to do binary search (bisect) on the coordinates. To achieve binary search in fact there is no need at all to load the coordinate values into memory, they could be left on disk (e.g., in HDF5 or Zarr dataset) and still achieve perfectly adequate performance for our needs.

This is of course also relevant to pandas but thought I'd post here as I know there have been some discussions about how to handle indexes when working with larger datasets via dask.

alimanfoo · 2017-10-23T11:19:30Z

Just to add a further thought, which is that the upper levels of the binary search tree could be be cached to get faster performance for repeated searches.

fmaussion · 2017-10-23T12:47:02Z

Has anyone considered implementing an index for monotonic data that does not require loading all values into main memory?

But this is already the case? See #1017

With on file datasets I think it is sufficient to drop_variables when opening the dataset in order not to parse the coordinates:

ds = xr.open_dataset(f, drop_variables=['lon', 'lat'])

rabernat · 2017-10-23T13:40:53Z

This is related to the performance issue documented in #1385.

alimanfoo · 2017-10-23T14:58:59Z

It looks like #1017 is about having no index at all. I want indexes, but I want to avoid loading all coordinate values into memory.

…

On Mon, Oct 23, 2017 at 1:47 PM, Fabien Maussion ***@***.***> wrote: Has anyone considered implementing an index for monotonic data that does not require loading all values into main memory? But this is already the case? #1017 <#1017> With on file datasets I *think* it is sufficient to drop_variables when opening the dataset in order not to parse the coordinates: ds = xr.open_dataset(f, drop_variables=['lon', 'lat']) — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#1650 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAq8QsbZ81N2pKybO1sFHVHK0KTk1aELks5svIrJgaJpZM4QCq62> .

-- Alistair Miles Head of Epidemiological Informatics Centre for Genomics and Global Health <http://cggh.org> Big Data Institute Building Old Road Campus Roosevelt Drive Oxford OX3 7LF United Kingdom Phone: +44 (0)1865 743596 Email: [email protected] Web: http://a <http://purl.org/net/aliman>limanfoo.github.io/ Twitter: https://twitter.com/alimanfoo

shoyer · 2017-10-23T20:02:12Z

This should be easier after the index/coordinates separation envisioned in #1603. We could potentially define a basic index API (based on what we currently use from pandas) and allow alternative index implementations. There are certainly other use cases where go beyond pandas makes sense -- a KDTree for indexing geospatial data is one obvious example.

alimanfoo · 2017-10-23T20:29:41Z

Index API sounds good.

Also I was just looking at dask.dataframe indexing, there .loc is implemented using information about index values at the boundaries of each partition (chunk). Not sure xarray should use same strategy for chunked datasets, but is another approach to avoid loading indexes into memory.

stale · 2019-10-22T05:58:59Z

In order to maintain a list of currently relevant issues, we mark issues as stale after a period of inactivity

If this issue remains relevant, please comment here or remove the stale label; otherwise it will be marked as closed automatically

d70-t · 2021-04-21T16:46:54Z

I'd be interested in this kind of thing as well. 👍

We are having long time series data, which we would like to access via opendap or zarr over HTTP. Currently, the time coordinate variable is already more than 1 GB in size, which makes loading the dataset very slow or even impossible given the limitations of the opendap server and my home internet wire. Nonetheless, we know that the timestamps are in order and reasonably close to equidistant. Thus binary search or even interpolation search should be a quick method to find the right indices.

dcherian · 2023-07-14T22:16:05Z

@lsetiawan has a cool use-case of opening a time dimension with a trillion elements stored in Zarr: https://github.com/lsetiawan/ooi-indexing/blob/main/demo.ipynb

cc @scottyhq

lsetiawan · 2023-07-15T01:12:52Z

Thanks @dcherian. I've modified the notebook after the discovery of some issues in the data, however, I was still able to try somethings with half the data that works (~283.7 GB of total data with ~39.13 GB of time dimension data). I started putting the small chunk lookups idea into the notebook, but I'm struggling in figuring out exactly what's going on under the hood for the flexible indexing (when should the data actually get fetched, etc), but it's getting there. Hopefully it's a good starting point for developing a nice DaskIndex 😄 !

ilan-gold · 2023-08-11T11:26:13Z

Perhaps a dumb question - why does the coords (i.e., "index") need to be loaded into memory to create the Dataset/DataArray objects?

benbovy · 2023-08-28T10:23:05Z

Here is a prototype of a DaskIndex: notebook

I also added it to the list in #7041.

It has basic support for label-based selection, where query labels may be slices or scalar values. It assumes that coordinate data is monotonic, and of course queries are not nearly as efficient as with, e.g., a pandas index. But data selection is fully lazy! I guess it would still be possible to implement some sort of basic data structure and/or cache to speed up the process?

It doesn't work well for alignment, but I doubt that we want automatic-alignment support for such index anyway, as this could trigger costly re-indexing operations and load huge amount of data.

The implementation is mostly taken from another index prototype based on numpy. I guess we could merge the two and provide a generic ArrayIndex that would work with a 1-d duck array (lazy or not)? If you think this would be a good fit in Xarray as an alternative to PandasIndex, I'm happy to submit a PR.

benbovy · 2023-08-28T10:56:59Z

I guess it would still be possible to implement some sort of basic data structure and/or cache to speed up the process?

I missed @lsetiawan and @dcherian's prototype (#1650 (comment)), which implements some lookup structure. Interesting!

Perhaps a dumb question - why does the coords (i.e., "index") need to be loaded into memory to create the Dataset/DataArray objects?

Only the coordinates with a PandasIndex, since a pd.Index is stored fully in memory.

ilan-gold · 2023-08-29T11:10:29Z

Hi @benbovy I set up a little example. Perhaps I have a mistake:

import fsspec, zarr
import xarray as xr
import dask.array as da
import numpy as np

g = zarr.open_group('./')
g['coords'] = zarr.array(np.arange(0,10000), chunks=(1000,))
g['data'] = zarr.array(np.arange(0,10000), chunks=(1000,))

class AccessTrackingStore(zarr.LRUStoreCache): # tracks first access to elements
     def __init__(self, *args, **kwargs):
         super().__init__(*args, **kwargs)

     def __getitem__(self, key):
         if key not in self._values_cache:
             print(key)
         return super().__getitem__(key)

mapper = fsspec.get_mapper('./')
store = AccessTrackingStore(mapper, max_size=2**28)
g = zarr.open_group(store)
data = da.from_zarr(g['data'])
coords = da.from_zarr(g['coords'])

Up until this point, only metadata should have been accessed. But then:

data_array = xr.DataArray(data, coords=[coords], dims=['dim'])

The above code will print out all of the chunks backing the coords data because it is reading the whole object into memory, it seems. Perhaps I am missing something, though.

benbovy · 2023-08-29T11:16:58Z

@ilan-gold data_array = xr.DataArray(data, coords=[coords], dims=['dim']) by default creates pandas indexes for each dimension coordinate found in coords, which loads the whole data into memory

Something like this would skip the creation of those default indexes:

coords_obj = xr.Coordinates({"dim": coords}, indexes={})

data_array = xr.DataArray(data, coords=coords_obj, dims=['dim'])

Then you can set a dask (lazy) index to data_array for the dim coordinate.

EDIT: this works only with the last Xarray release v2023.08.0

ilan-gold · 2023-08-29T11:31:07Z

@benbovy The above errors out for me when appended directly to the end of my first code block:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[2], line 1
----> 1 coords_obj = xr.Coordinates({"dim": coords}, indexes={})

File ~/Projects/Theis/anndata/venv/lib/python3.10/site-packages/xarray/core/coordinates.py:235, in Coordinates.__init__(self, coords, indexes)
    233     variables = {k: v.copy() for k, v in coords.variables.items()}
    234 else:
--> 235     variables = {k: as_variable(v) for k, v in coords.items()}
    237 if indexes is None:
    238     indexes = {}

File ~/Projects/Theis/anndata/venv/lib/python3.10/site-packages/xarray/core/coordinates.py:235, in <dictcomp>(.0)
    233     variables = {k: v.copy() for k, v in coords.variables.items()}
    234 else:
--> 235     variables = {k: as_variable(v) for k, v in coords.items()}
    237 if indexes is None:
    238     indexes = {}

File ~/Projects/Theis/anndata/venv/lib/python3.10/site-packages/xarray/core/variable.py:154, in as_variable(obj, name)
    152     obj = Variable(name, data, fastpath=True)
    153 else:
--> 154     raise TypeError(
    155         f"Variable {name!r}: unable to convert object into a variable without an "
    156         f"explicit list of dimensions: {obj!r}"
    157     )
    159 if name is not None and name in obj.dims and obj.ndim == 1:
    160     # automatically convert the Variable into an Index
    161     obj = obj.to_index_variable()

TypeError: Variable None: unable to convert object into a variable without an explicit list of dimensions: dask.array<from-zarr, shape=(10000,), dtype=int64, chunksize=(1000,), chunktype=numpy.ndarray>

I looked into this a bit. The error seems to be coming from the lack of a name passed to as_variable but it seems like even if I do pass that in, the data is still read into memory:

In [4]: xr.as_variable(coords, name='dim')
coords/0
coords/1
coords/2
coords/3
coords/4
coords/5
coords/6
coords/7
coords/8
coords/9
Out[4]:
<xarray.IndexVariable 'dim' (dim: 10000)>
array([   0,    1,    2, ..., 9997, 9998, 9999])

benbovy · 2023-08-29T11:43:19Z

Ah yes sorry, it is actually a bit messy as there are multiple open pull-requests for fixing those issues.

So it should work with #8094 and

coords_obj = xr.Coordinates({"dim": ("dim", coords)}, indexes={})

data_array = xr.DataArray(data, coords=coords_obj, dims=['dim'])

jhamman added the design question label Nov 21, 2017

benbovy mentioned this issue Mar 4, 2018

Extend xarray with custom "coordinate wrappers" #1961

Closed

fujiisoup mentioned this issue Apr 20, 2018

A way to generate automatically-numbered coords #2067

Closed

stale bot added the stale label Oct 22, 2019

stale bot closed this as completed Nov 21, 2019

dcherian removed the stale label Nov 21, 2019

dcherian reopened this Nov 21, 2019

eric-czech mentioned this issue Apr 12, 2020

Explore Xarray as the basis for a genetic toolkit API related-sciences/gwas-analysis#5

Closed

TomAugspurger mentioned this issue Jan 26, 2021

Supporting out-of-core computation/indexing for very large indexes #1094

Open

rabernat mentioned this issue Mar 9, 2021

Flexible indexes refactoring notes #4979

Merged

ilan-gold mentioned this issue Aug 11, 2023

(feat): experimental read_backed method for zarr + hdf5 via read_dispatched scverse/anndata#947

Closed

benbovy mentioned this issue Aug 30, 2023

More flexible index variables #8124

Draft

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Low memory/out-of-core index? #1650

Low memory/out-of-core index? #1650

alimanfoo commented Oct 23, 2017

alimanfoo commented Oct 23, 2017

fmaussion commented Oct 23, 2017 •

edited

Loading

rabernat commented Oct 23, 2017

alimanfoo commented Oct 23, 2017 via email

shoyer commented Oct 23, 2017

alimanfoo commented Oct 23, 2017

stale bot commented Oct 22, 2019

d70-t commented Apr 21, 2021 •

edited

Loading

dcherian commented Jul 14, 2023

lsetiawan commented Jul 15, 2023

ilan-gold commented Aug 11, 2023

benbovy commented Aug 28, 2023

benbovy commented Aug 28, 2023

ilan-gold commented Aug 29, 2023

benbovy commented Aug 29, 2023 •

edited

Loading

ilan-gold commented Aug 29, 2023

benbovy commented Aug 29, 2023

Low memory/out-of-core index? #1650

Low memory/out-of-core index? #1650

Comments

alimanfoo commented Oct 23, 2017

alimanfoo commented Oct 23, 2017

fmaussion commented Oct 23, 2017 • edited Loading

rabernat commented Oct 23, 2017

alimanfoo commented Oct 23, 2017 via email

shoyer commented Oct 23, 2017

alimanfoo commented Oct 23, 2017

stale bot commented Oct 22, 2019

d70-t commented Apr 21, 2021 • edited Loading

dcherian commented Jul 14, 2023

lsetiawan commented Jul 15, 2023

ilan-gold commented Aug 11, 2023

benbovy commented Aug 28, 2023

benbovy commented Aug 28, 2023

ilan-gold commented Aug 29, 2023

benbovy commented Aug 29, 2023 • edited Loading

ilan-gold commented Aug 29, 2023

benbovy commented Aug 29, 2023

fmaussion commented Oct 23, 2017 •

edited

Loading

d70-t commented Apr 21, 2021 •

edited

Loading

benbovy commented Aug 29, 2023 •

edited

Loading