Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Low memory/out-of-core index? #1650

Open
alimanfoo opened this issue Oct 23, 2017 · 17 comments
Open

Low memory/out-of-core index? #1650

alimanfoo opened this issue Oct 23, 2017 · 17 comments

Comments

@alimanfoo
Copy link
Contributor

Has anyone considered implementing an index for monotonic data that does not require loading all values into main memory?

Motivation: We have data where first dimension can be length ~100,000,000, and coordinates for this dimension are stored as 32-bit integers. Currently if we used a pandas Index this would cast to 64-bit integers, and the index would require ~1GB RAM. This isn't enormous, but isn't negligible for people working on modest computers. Our use cases are simple, typically we only ever need to locate a slice of this dimension from a pair of coordinates, i.e., we only need to do binary search (bisect) on the coordinates. To achieve binary search in fact there is no need at all to load the coordinate values into memory, they could be left on disk (e.g., in HDF5 or Zarr dataset) and still achieve perfectly adequate performance for our needs.

This is of course also relevant to pandas but thought I'd post here as I know there have been some discussions about how to handle indexes when working with larger datasets via dask.

@alimanfoo
Copy link
Contributor Author

Just to add a further thought, which is that the upper levels of the binary search tree could be be cached to get faster performance for repeated searches.

@fmaussion
Copy link
Member

fmaussion commented Oct 23, 2017

Has anyone considered implementing an index for monotonic data that does not require loading all values into main memory?

But this is already the case? See #1017

With on file datasets I think it is sufficient to drop_variables when opening the dataset in order not to parse the coordinates:

ds = xr.open_dataset(f, drop_variables=['lon', 'lat'])

@rabernat
Copy link
Contributor

This is related to the performance issue documented in #1385.

@alimanfoo
Copy link
Contributor Author

alimanfoo commented Oct 23, 2017 via email

@shoyer
Copy link
Member

shoyer commented Oct 23, 2017

This should be easier after the index/coordinates separation envisioned in #1603. We could potentially define a basic index API (based on what we currently use from pandas) and allow alternative index implementations. There are certainly other use cases where go beyond pandas makes sense -- a KDTree for indexing geospatial data is one obvious example.

@alimanfoo
Copy link
Contributor Author

Index API sounds good.

Also I was just looking at dask.dataframe indexing, there .loc is implemented using information about index values at the boundaries of each partition (chunk). Not sure xarray should use same strategy for chunked datasets, but is another approach to avoid loading indexes into memory.

@stale
Copy link

stale bot commented Oct 22, 2019

In order to maintain a list of currently relevant issues, we mark issues as stale after a period of inactivity

If this issue remains relevant, please comment here or remove the stale label; otherwise it will be marked as closed automatically

@d70-t
Copy link
Contributor

d70-t commented Apr 21, 2021

I'd be interested in this kind of thing as well. 👍

We are having long time series data, which we would like to access via opendap or zarr over HTTP. Currently, the time coordinate variable is already more than 1 GB in size, which makes loading the dataset very slow or even impossible given the limitations of the opendap server and my home internet wire. Nonetheless, we know that the timestamps are in order and reasonably close to equidistant. Thus binary search or even interpolation search should be a quick method to find the right indices.

@dcherian
Copy link
Contributor

@lsetiawan has a cool use-case of opening a time dimension with a trillion elements stored in Zarr: https://github.com/lsetiawan/ooi-indexing/blob/main/demo.ipynb

cc @scottyhq

@lsetiawan
Copy link
Contributor

Thanks @dcherian. I've modified the notebook after the discovery of some issues in the data, however, I was still able to try somethings with half the data that works (~283.7 GB of total data with ~39.13 GB of time dimension data). I started putting the small chunk lookups idea into the notebook, but I'm struggling in figuring out exactly what's going on under the hood for the flexible indexing (when should the data actually get fetched, etc), but it's getting there. Hopefully it's a good starting point for developing a nice DaskIndex 😄 !

@ilan-gold
Copy link
Contributor

Perhaps a dumb question - why does the coords (i.e., "index") need to be loaded into memory to create the Dataset/DataArray objects?

@benbovy
Copy link
Member

benbovy commented Aug 28, 2023

Here is a prototype of a DaskIndex: notebook

I also added it to the list in #7041.

It has basic support for label-based selection, where query labels may be slices or scalar values. It assumes that coordinate data is monotonic, and of course queries are not nearly as efficient as with, e.g., a pandas index. But data selection is fully lazy! I guess it would still be possible to implement some sort of basic data structure and/or cache to speed up the process?

It doesn't work well for alignment, but I doubt that we want automatic-alignment support for such index anyway, as this could trigger costly re-indexing operations and load huge amount of data.

The implementation is mostly taken from another index prototype based on numpy. I guess we could merge the two and provide a generic ArrayIndex that would work with a 1-d duck array (lazy or not)? If you think this would be a good fit in Xarray as an alternative to PandasIndex, I'm happy to submit a PR.

@benbovy
Copy link
Member

benbovy commented Aug 28, 2023

I guess it would still be possible to implement some sort of basic data structure and/or cache to speed up the process?

I missed @lsetiawan and @dcherian's prototype (#1650 (comment)), which implements some lookup structure. Interesting!

Perhaps a dumb question - why does the coords (i.e., "index") need to be loaded into memory to create the Dataset/DataArray objects?

Only the coordinates with a PandasIndex, since a pd.Index is stored fully in memory.

@ilan-gold
Copy link
Contributor

Hi @benbovy I set up a little example. Perhaps I have a mistake:

import fsspec, zarr
import xarray as xr
import dask.array as da
import numpy as np

g = zarr.open_group('./')
g['coords'] = zarr.array(np.arange(0,10000), chunks=(1000,))
g['data'] = zarr.array(np.arange(0,10000), chunks=(1000,))

class AccessTrackingStore(zarr.LRUStoreCache): # tracks first access to elements
     def __init__(self, *args, **kwargs):
         super().__init__(*args, **kwargs)

     def __getitem__(self, key):
         if key not in self._values_cache:
             print(key)
         return super().__getitem__(key)

mapper = fsspec.get_mapper('./')
store = AccessTrackingStore(mapper, max_size=2**28)
g = zarr.open_group(store)
data = da.from_zarr(g['data'])
coords = da.from_zarr(g['coords'])

Up until this point, only metadata should have been accessed. But then:

data_array = xr.DataArray(data, coords=[coords], dims=['dim'])

The above code will print out all of the chunks backing the coords data because it is reading the whole object into memory, it seems. Perhaps I am missing something, though.

@benbovy
Copy link
Member

benbovy commented Aug 29, 2023

@ilan-gold data_array = xr.DataArray(data, coords=[coords], dims=['dim']) by default creates pandas indexes for each dimension coordinate found in coords, which loads the whole data into memory

Something like this would skip the creation of those default indexes:

coords_obj = xr.Coordinates({"dim": coords}, indexes={})

data_array = xr.DataArray(data, coords=coords_obj, dims=['dim'])

Then you can set a dask (lazy) index to data_array for the dim coordinate.

EDIT: this works only with the last Xarray release v2023.08.0

@ilan-gold
Copy link
Contributor

@benbovy The above errors out for me when appended directly to the end of my first code block:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[2], line 1
----> 1 coords_obj = xr.Coordinates({"dim": coords}, indexes={})

File ~/Projects/Theis/anndata/venv/lib/python3.10/site-packages/xarray/core/coordinates.py:235, in Coordinates.__init__(self, coords, indexes)
    233     variables = {k: v.copy() for k, v in coords.variables.items()}
    234 else:
--> 235     variables = {k: as_variable(v) for k, v in coords.items()}
    237 if indexes is None:
    238     indexes = {}

File ~/Projects/Theis/anndata/venv/lib/python3.10/site-packages/xarray/core/coordinates.py:235, in <dictcomp>(.0)
    233     variables = {k: v.copy() for k, v in coords.variables.items()}
    234 else:
--> 235     variables = {k: as_variable(v) for k, v in coords.items()}
    237 if indexes is None:
    238     indexes = {}

File ~/Projects/Theis/anndata/venv/lib/python3.10/site-packages/xarray/core/variable.py:154, in as_variable(obj, name)
    152     obj = Variable(name, data, fastpath=True)
    153 else:
--> 154     raise TypeError(
    155         f"Variable {name!r}: unable to convert object into a variable without an "
    156         f"explicit list of dimensions: {obj!r}"
    157     )
    159 if name is not None and name in obj.dims and obj.ndim == 1:
    160     # automatically convert the Variable into an Index
    161     obj = obj.to_index_variable()

TypeError: Variable None: unable to convert object into a variable without an explicit list of dimensions: dask.array<from-zarr, shape=(10000,), dtype=int64, chunksize=(1000,), chunktype=numpy.ndarray>

I looked into this a bit. The error seems to be coming from the lack of a name passed to as_variable but it seems like even if I do pass that in, the data is still read into memory:

In [4]: xr.as_variable(coords, name='dim')
coords/0
coords/1
coords/2
coords/3
coords/4
coords/5
coords/6
coords/7
coords/8
coords/9
Out[4]:
<xarray.IndexVariable 'dim' (dim: 10000)>
array([   0,    1,    2, ..., 9997, 9998, 9999])

@benbovy
Copy link
Member

benbovy commented Aug 29, 2023

Ah yes sorry, it is actually a bit messy as there are multiple open pull-requests for fixing those issues.

So it should work with #8094 and

coords_obj = xr.Coordinates({"dim": ("dim", coords)}, indexes={})

data_array = xr.DataArray(data, coords=coords_obj, dims=['dim'])

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

10 participants