Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Supporting out-of-core computation/indexing for very large indexes #1094

Open
benbovy opened this issue Nov 8, 2016 · 5 comments
Open

Supporting out-of-core computation/indexing for very large indexes #1094

benbovy opened this issue Nov 8, 2016 · 5 comments

Comments

@benbovy
Copy link
Member

benbovy commented Nov 8, 2016

(Follow-up of discussion here #1024 (comment)).

xarray + dask.array successfully enable out-of-core computation for very large variables that doesn't fit in memory. One current limitation is that the indexes of a Dataset or DataArray, which rely on pandas.Index, are still fully loaded into memory (it will be soon loaded eagerly after #1024). In many cases this is not a problem, as the sizes of 1-dimensional indexes are usually much smaller than the sizes of n-dimensional variables or coordinates.

However, this may be problematic in some specific cases where we have to deal with very large indexes. As an example, big unstructured meshes often have coordinates (x, y, z) arranged as 1-d arrays of length that equals the number of nodes, which can be very large!! (See, e.g., ugrid conventions).

It would be very nice if xarray could also help for these use cases. Therefore I'm wondering if (and how) out-of-core support can be extended to indexes and indexing.

I've briefly looked at the documentation on dask.dataframe, and a first naive approach I have in mind would be to allow partitioning an index into multiple, contiguous indexes. For label-based indexing, we might for example map indexing.convert_label_indexer to each partition and combine the returned indexers.

My knowledge of dask is very limited, though. So I've no doubt that this suggestion is very simplistic and not very efficient, or that there are better approaches. I'm also certainly missing other issues not directly related to indexing.

Any thoughts?

cc @shoyer @mrocklin

@shoyer
Copy link
Member

shoyer commented Nov 8, 2016

For unstructured meshes of points, pandas.MultiIndex is not the right abstraction.

Suppose you have a (very long) list of sorted points (x, y, z) in a multi-index. You can efficiently query within fixed bounds along x by doing binary search. But for queries in y and z, you cannot do any better than looking through the entire list. Moreover, pandas.MultiIndex factorizes each level into unique values, which is a complete waste on an unstructured grid where few coordinate overlap.

For unstructured meshes, you need something like a KDTree (see discussion in #475), with ideally with nearby points in space stored in contiguous array chunks.

I would start with trying to get an in-memory KDTree working, and then switch to something out of core only when/if necessary. For example, SciPy's cKDTree can load 1e7 points in 3-dimensions in only a few seconds:

x = np.random.rand(int(1e7), 3)
%time tree = scipy.spatial.cKDTree(x, leafsize=100)
# CPU times: user 2.58 s, sys: 0 ns, total: 2.58 s
# Wall time: 2.55 s

The might be good enough.

@benbovy
Copy link
Member Author

benbovy commented Nov 8, 2016

Yes I understand that using a pandas.MultiIndex for such case is not efficient at all for both indexing and regarding memory usage and index building.

My example was actually not complete, since I also have categorical indexes such as a few regions defined in space (with complex geometries) and node types (e.g., boundary, active, inactive). Sorry not to have mentioned that.

a KDTree is indeed good for indexing on space coordinates. Looking at the API you suggest in #475, my (2-d) mesh might look like this:

>>> ds
<xray.Dataset>
Dimensions:         (node: 10000000)
Coordinates:
  * node
  - region          (node) int32 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 ...
  - node_type       (node) object 'b' 'b' 'b' 'b' 'a' 'a' 'a' 'a' 'a' 'a' 'a' ...
  * spatial_index   (node) KDTree
  - x               (node) float64 49.754 56.823 65.765 93.058 96.691 105.832 ...
  - y               (node) float64 37.582 45.769 58.672 77.029 82.983 99.672 ...
Data variables:
    topo_elevation  (node) float64 57.352 48.710 47.080 49.072 33.184 54.833 ...

Anyway, maybe I've opened this issue a bit too early since my data still fits into memory, though it is likely that I'll have to deal with meshes of 1e8 to 1e9 nodes in a near future.

Side note: I don't know why I get much worse performance on my machine when building the KDTree? (Intel(R) Xeon(R) CPU x4 5160 @ 3.00GHz, 16 Gb RAM, scipy 0.18.1, numpy 1.11.2)

In [3]: x = np.random.rand(int(1e7), 3)

In [4]: %time tree = scipy.spatial.cKDTree(x, leafsize=100)
CPU times: user 38 s, sys: 64 ms, total: 38.1 s
Wall time: 38.1 s

@rabernat
Copy link
Contributor

rabernat commented Nov 8, 2016

My cKDTree time was:

  • 19.2 s on a 32-core Intel(R) Xeon(R) CPU E5-4627 v2 @ 3.30GHz, 512 GB RAM.
  • 23 s on my Macbook (1.7 GHz Intel Core i7)

@stale
Copy link

stale bot commented Jan 25, 2019

In order to maintain a list of currently relevant issues, we mark issues as stale after a period of inactivity
If this issue remains relevant, please comment here; otherwise it will be marked as closed automatically

@stale stale bot added the stale label Jan 25, 2019
@stale stale bot closed this as completed Feb 24, 2019
@shoyer shoyer reopened this Feb 24, 2019
@stale stale bot removed the stale label Feb 24, 2019
@TomAugspurger
Copy link
Contributor

Should this and #1650 be consolidated into a single issue? I think that they're duplicates of eachother.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants