Supporting out-of-core computation/indexing for very large indexes #1094

benbovy · 2016-11-08T00:56:56Z

(Follow-up of discussion here #1024 (comment)).

xarray + dask.array successfully enable out-of-core computation for very large variables that doesn't fit in memory. One current limitation is that the indexes of a Dataset or DataArray, which rely on pandas.Index, are still fully loaded into memory (it will be soon loaded eagerly after #1024). In many cases this is not a problem, as the sizes of 1-dimensional indexes are usually much smaller than the sizes of n-dimensional variables or coordinates.

However, this may be problematic in some specific cases where we have to deal with very large indexes. As an example, big unstructured meshes often have coordinates (x, y, z) arranged as 1-d arrays of length that equals the number of nodes, which can be very large!! (See, e.g., ugrid conventions).

It would be very nice if xarray could also help for these use cases. Therefore I'm wondering if (and how) out-of-core support can be extended to indexes and indexing.

I've briefly looked at the documentation on dask.dataframe, and a first naive approach I have in mind would be to allow partitioning an index into multiple, contiguous indexes. For label-based indexing, we might for example map indexing.convert_label_indexer to each partition and combine the returned indexers.

My knowledge of dask is very limited, though. So I've no doubt that this suggestion is very simplistic and not very efficient, or that there are better approaches. I'm also certainly missing other issues not directly related to indexing.

Any thoughts?

cc @shoyer @mrocklin

The text was updated successfully, but these errors were encountered:

shoyer · 2016-11-08T01:52:40Z

For unstructured meshes of points, pandas.MultiIndex is not the right abstraction.

Suppose you have a (very long) list of sorted points (x, y, z) in a multi-index. You can efficiently query within fixed bounds along x by doing binary search. But for queries in y and z, you cannot do any better than looking through the entire list. Moreover, pandas.MultiIndex factorizes each level into unique values, which is a complete waste on an unstructured grid where few coordinate overlap.

For unstructured meshes, you need something like a KDTree (see discussion in #475), with ideally with nearby points in space stored in contiguous array chunks.

I would start with trying to get an in-memory KDTree working, and then switch to something out of core only when/if necessary. For example, SciPy's cKDTree can load 1e7 points in 3-dimensions in only a few seconds:

x = np.random.rand(int(1e7), 3)
%time tree = scipy.spatial.cKDTree(x, leafsize=100)
# CPU times: user 2.58 s, sys: 0 ns, total: 2.58 s
# Wall time: 2.55 s

The might be good enough.

benbovy · 2016-11-08T12:12:54Z

Yes I understand that using a pandas.MultiIndex for such case is not efficient at all for both indexing and regarding memory usage and index building.

My example was actually not complete, since I also have categorical indexes such as a few regions defined in space (with complex geometries) and node types (e.g., boundary, active, inactive). Sorry not to have mentioned that.

a KDTree is indeed good for indexing on space coordinates. Looking at the API you suggest in #475, my (2-d) mesh might look like this:

>>> ds
<xray.Dataset>
Dimensions:         (node: 10000000)
Coordinates:
  * node
  - region          (node) int32 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 ...
  - node_type       (node) object 'b' 'b' 'b' 'b' 'a' 'a' 'a' 'a' 'a' 'a' 'a' ...
  * spatial_index   (node) KDTree
  - x               (node) float64 49.754 56.823 65.765 93.058 96.691 105.832 ...
  - y               (node) float64 37.582 45.769 58.672 77.029 82.983 99.672 ...
Data variables:
    topo_elevation  (node) float64 57.352 48.710 47.080 49.072 33.184 54.833 ...

Anyway, maybe I've opened this issue a bit too early since my data still fits into memory, though it is likely that I'll have to deal with meshes of 1e8 to 1e9 nodes in a near future.

Side note: I don't know why I get much worse performance on my machine when building the KDTree? (Intel(R) Xeon(R) CPU x4 5160 @ 3.00GHz, 16 Gb RAM, scipy 0.18.1, numpy 1.11.2)

In [3]: x = np.random.rand(int(1e7), 3)

In [4]: %time tree = scipy.spatial.cKDTree(x, leafsize=100)
CPU times: user 38 s, sys: 64 ms, total: 38.1 s
Wall time: 38.1 s

rabernat · 2016-11-08T16:38:27Z

My cKDTree time was:

19.2 s on a 32-core Intel(R) Xeon(R) CPU E5-4627 v2 @ 3.30GHz, 512 GB RAM.
23 s on my Macbook (1.7 GHz Intel Core i7)

stale · 2019-01-25T06:21:47Z

In order to maintain a list of currently relevant issues, we mark issues as stale after a period of inactivity
If this issue remains relevant, please comment here; otherwise it will be marked as closed automatically

TomAugspurger · 2021-01-26T20:09:11Z

Should this and #1650 be consolidated into a single issue? I think that they're duplicates of eachother.

stale bot added the stale label Jan 25, 2019

stale bot closed this as completed Feb 24, 2019

shoyer reopened this Feb 24, 2019

stale bot removed the stale label Feb 24, 2019

rabernat mentioned this issue May 13, 2020

dask.array as Coordinate variable #3852

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Supporting out-of-core computation/indexing for very large indexes #1094

Supporting out-of-core computation/indexing for very large indexes #1094

benbovy commented Nov 8, 2016

shoyer commented Nov 8, 2016

benbovy commented Nov 8, 2016 •

edited

Loading

rabernat commented Nov 8, 2016

stale bot commented Jan 25, 2019

TomAugspurger commented Jan 26, 2021

Supporting out-of-core computation/indexing for very large indexes #1094

Supporting out-of-core computation/indexing for very large indexes #1094

Comments

benbovy commented Nov 8, 2016

shoyer commented Nov 8, 2016

benbovy commented Nov 8, 2016 • edited Loading

rabernat commented Nov 8, 2016

stale bot commented Jan 25, 2019

TomAugspurger commented Jan 26, 2021

benbovy commented Nov 8, 2016 •

edited

Loading