-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Supporting out-of-core computation/indexing for very large indexes #1094
Comments
For unstructured meshes of points, pandas.MultiIndex is not the right abstraction. Suppose you have a (very long) list of sorted points For unstructured meshes, you need something like a KDTree (see discussion in #475), with ideally with nearby points in space stored in contiguous array chunks. I would start with trying to get an in-memory KDTree working, and then switch to something out of core only when/if necessary. For example, SciPy's cKDTree can load 1e7 points in 3-dimensions in only a few seconds:
The might be good enough. |
Yes I understand that using a My example was actually not complete, since I also have categorical indexes such as a few regions defined in space (with complex geometries) and node types (e.g., boundary, active, inactive). Sorry not to have mentioned that. a KDTree is indeed good for indexing on space coordinates. Looking at the API you suggest in #475, my (2-d) mesh might look like this:
Anyway, maybe I've opened this issue a bit too early since my data still fits into memory, though it is likely that I'll have to deal with meshes of 1e8 to 1e9 nodes in a near future. Side note: I don't know why I get much worse performance on my machine when building the KDTree? (Intel(R) Xeon(R) CPU x4 5160 @ 3.00GHz, 16 Gb RAM, scipy 0.18.1, numpy 1.11.2)
|
My cKDTree time was:
|
In order to maintain a list of currently relevant issues, we mark issues as stale after a period of inactivity |
Should this and #1650 be consolidated into a single issue? I think that they're duplicates of eachother. |
(Follow-up of discussion here #1024 (comment)).
xarray + dask.array successfully enable out-of-core computation for very large variables that doesn't fit in memory. One current limitation is that the indexes of a
Dataset
orDataArray
, which rely onpandas.Index
, are still fully loaded into memory (it will be soon loaded eagerly after #1024). In many cases this is not a problem, as the sizes of 1-dimensional indexes are usually much smaller than the sizes of n-dimensional variables or coordinates.However, this may be problematic in some specific cases where we have to deal with very large indexes. As an example, big unstructured meshes often have coordinates (x, y, z) arranged as 1-d arrays of length that equals the number of nodes, which can be very large!! (See, e.g., ugrid conventions).
It would be very nice if xarray could also help for these use cases. Therefore I'm wondering if (and how) out-of-core support can be extended to indexes and indexing.
I've briefly looked at the documentation on
dask.dataframe
, and a first naive approach I have in mind would be to allow partitioning an index into multiple, contiguous indexes. For label-based indexing, we might for example mapindexing.convert_label_indexer
to each partition and combine the returned indexers.My knowledge of dask is very limited, though. So I've no doubt that this suggestion is very simplistic and not very efficient, or that there are better approaches. I'm also certainly missing other issues not directly related to indexing.
Any thoughts?
cc @shoyer @mrocklin
The text was updated successfully, but these errors were encountered: