-
-
Notifications
You must be signed in to change notification settings - Fork 404
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added dask data interface #974
Conversation
Just have to add a docstring and update the tests:
|
Cool!!!!!! |
dc0e11f
to
d0c84d6
Compare
Buildbot still failing to update reference data for some reason:
|
Not sure what that buildbot message could be about. I don't quite understand the task execution plot; is there some reason the core numbers keep going up? Seems like it's only ever using 8 cores, but then for some reason which 8 it is changes over time? Confusing! |
Hopefully @jlstevens can figure it out ;-)
Tbh I don't quite understand that bit either, it's nice to watch while it's executing though. |
8937d8b
to
9c80879
Compare
empty.loc[0, :] = (np.NaN,) * empty.shape[1] | ||
paths = [elem for path in paths for elem in (path, empty)][:-1] | ||
datasets = [Dataset(p) for p in paths] | ||
if isinstance(paths[0], dd.DataFrame): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isinstance
checks over data formats is just the sort of thing interfaces are supposed to handle for you. I am hoping we can get rid of these isinstance checks, perhaps by using the appropriate utility to select the right interface based on the data type (i.e whatever dataframe type it happens to be)?
I've reviewed this PR and for the most part, I am generally happy with it as a step towards proper dask support. For this PR, I feel the new code in
|
Looks good! Merging. |
This pull request has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
This PR adds an interface for Dask Dataframes making it possible to work with very large out-of-core dataframes. The interface is almost complete with some notable exceptions:
sort
method simply warns and continues.aggregate
andreduce
will fail.sort=False
on aggregations, meaning that the aggregation groups are sorted and do not preserve the same order as other interfaces.add_dimension
will error when supplied a non-scalar value.Otherwise the full dataset test suite is being run against the interface so it all seems to be working.
Here is an example loading a 1.1GB CSV file and generating a DynamicMap of datashaded images grouped by the origin of the flights. In this example only the flight origins have to be loaded to apply the groupby. The aggregated data is not loaded until the first datashaded plot is displayed:
And here is an example execution graph from a fairly complex expression, computing the mean velocity for every flight callsign originating in Algeria.
And here's a task execution plot from a datashader aggregation executed on two remote workers:
Overall this will do for us what xarray/iris have done for gridded data, letting us lazily load columnar data, extending our reach to datasets that are more than a few gigabytes.