Added dask data interface #974

philippjfr · 2016-11-05T17:34:52Z

This PR adds an interface for Dask Dataframes making it possible to work with very large out-of-core dataframes. The interface is almost complete with some notable exceptions:

Dask dataframes do not support sorting which means that the sort method simply warns and continues.
Not all functions can be easily applied to a dask dataframe so some functions applied with aggregate and reduce will fail.
Dask does not support setting sort=False on aggregations, meaning that the aggregation groups are sorted and do not preserve the same order as other interfaces.
Dask does not easily support adding a new column to an existing dataframe unless it is a scalar, therefore add_dimension will error when supplied a non-scalar value.

Otherwise the full dataset test suite is being run against the interface so it all seems to be working.

Here is an example loading a 1.1GB CSV file and generating a DynamicMap of datashaded images grouped by the origin of the flights. In this example only the flight origins have to be loaded to apply the groupby. The aggregated data is not loaded until the first datashaded plot is displayed:

%%timeit -r 1 -n 1
df = pd.read_csv('../apps/opensky.csv')
dataset = hv.Dataset(df, vdims=['velocity'])
groups = dataset.to(hv.Points, ['longitude', 'latitude'], [], ['origin'], dynamic=True)
shaded_origins=datashade(groups)

1 loop, best of 1: 22 s per loop

%%timeit -r 1 -n 1
df = dd.read_csv('../apps/opensky.csv', blocksize=50000000)
dataset = hv.Dataset(df, vdims=['velocity'])
groups = dataset.to(hv.Points, ['longitude', 'latitude'], [], ['origin'], dynamic=True)
shaded_origins=datashade(groups)

1 loop, best of 1: 8.19 s per loop

And here is an example execution graph from a fairly complex expression, computing the mean velocity for every flight callsign originating in Algeria.

hv.Dataset(dd.read_csv('../apps/opensky.csv', blocksize=50000000), vdims=['velocity'])\
.select(origin='Algeria').aggregate(['icao24'], np.mean).data.visualize()

And here's a task execution plot from a datashader aggregation executed on two remote workers:

Overall this will do for us what xarray/iris have done for gridded data, letting us lazily load columnar data, extending our reach to datasets that are more than a few gigabytes.

philippjfr · 2016-11-05T17:44:42Z

Just have to add a docstring and update the tests:

======================================================================
FAIL: test_Columnar_Data_data_002 (__main__.NBTester)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/travis/build/ioam/holoviews/doc/nbpublisher/nbtest.py", line 512, in data_comparison
    raise e
AssertionError: ['array', 'dataframe', 'dictionary', 'grid', 'ndelement', 'cube', 'xarray', 'dask'] != ['array', 'dataframe', 'dictionary', 'grid', 'ndelement', 'cube', 'xarray']

jbednar · 2016-11-05T22:27:29Z

Cool!!!!!!

philippjfr · 2016-11-06T02:02:36Z

Buildbot still failing to update reference data for some reason:

You asked to amend the most recent commit, but doing so would make
it empty. You can repeat your command with --allow-empty, or you can
remove the commit entirely with "git reset HEAD^".

jbednar · 2016-11-07T20:03:17Z

Not sure what that buildbot message could be about.

I don't quite understand the task execution plot; is there some reason the core numbers keep going up? Seems like it's only ever using 8 cores, but then for some reason which 8 it is changes over time? Confusing!

philippjfr · 2016-11-08T12:10:39Z

Not sure what that buildbot message could be about.

Hopefully @jlstevens can figure it out ;-)

I don't quite understand the task execution plot; is there some reason the core numbers keep going up? Seems like it's only ever using 8 cores, but then for some reason which 8 it is changes over time? Confusing!

Tbh I don't quite understand that bit either, it's nice to watch while it's executing though.

jlstevens · 2016-11-15T22:55:52Z

holoviews/operation/datashader.py

+                empty.loc[0, :] = (np.NaN,) * empty.shape[1]
+                paths = [elem for path in paths for elem in (path, empty)][:-1]
+            datasets = [Dataset(p) for p in paths]
+            if isinstance(paths[0], dd.DataFrame):


isinstance checks over data formats is just the sort of thing interfaces are supposed to handle for you. I am hoping we can get rid of these isinstance checks, perhaps by using the appropriate utility to select the right interface based on the data type (i.e whatever dataframe type it happens to be)?

jlstevens · 2016-11-15T23:01:29Z

I've reviewed this PR and for the most part, I am generally happy with it as a step towards proper dask support.

For this PR, I feel the new code inget_agg_data can probably be improved to avoid the use of isinstance. Other than that, we discussed two other changes (not to be implemented in this PR) that would complete the dask interface:

Using __nonzero__ (Python 2) and __bool__ (Python 3) instead of __len__ to check 'truthiness'.
Completing some of the missing methods (indicated where SkipTest has been used in the unit tests), specifically those to do with adding dimension values and boolean indexing. If the sorting methods can't be supported, I feel that would be an acceptable limitation.

jlstevens · 2016-11-16T11:20:21Z

Looks good! Merging.

philippjfr added the tag: component: data label Nov 5, 2016

Added dask data interface

f7c284d

philippjfr added the type: feature A major new feature label Nov 5, 2016

philippjfr added 3 commits November 5, 2016 22:58

Added docstring for DaskInterface

b2857f5

Dask optimizations

06b62d8

Dask support for datashader operations

d0c84d6

philippjfr force-pushed the dask_interface branch from dc0e11f to d0c84d6 Compare November 6, 2016 01:56

philippjfr added 2 commits November 8, 2016 13:14

Small fixes for dataset length

dbae3e6

Python3 dask aggregate fixes

9c80879

philippjfr force-pushed the dask_interface branch from 8937d8b to 9c80879 Compare November 8, 2016 16:05

Removed datashader path

da7f4a2

jlstevens reviewed Nov 15, 2016

View reviewed changes

Avoid using data interfaces in datashader operations

278dbed

jlstevens merged commit 3ba0b42 into master Nov 16, 2016

philippjfr mentioned this pull request Nov 16, 2016

Implement nonzero/bool methods to override len #988

Closed

philippjfr deleted the dask_interface branch December 10, 2016 23:42

philippjfr added this to the v1.7.0 milestone Jan 14, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added dask data interface #974

Added dask data interface #974

philippjfr commented Nov 5, 2016 •

edited

Loading

philippjfr commented Nov 5, 2016

jbednar commented Nov 5, 2016

philippjfr commented Nov 6, 2016

jbednar commented Nov 7, 2016

philippjfr commented Nov 8, 2016

jlstevens Nov 15, 2016

jlstevens commented Nov 15, 2016

jlstevens commented Nov 16, 2016

Added dask data interface #974

Added dask data interface #974

Conversation

philippjfr commented Nov 5, 2016 • edited Loading

philippjfr commented Nov 5, 2016

jbednar commented Nov 5, 2016

philippjfr commented Nov 6, 2016

jbednar commented Nov 7, 2016

philippjfr commented Nov 8, 2016

jlstevens Nov 15, 2016

Choose a reason for hiding this comment

jlstevens commented Nov 15, 2016

jlstevens commented Nov 16, 2016

philippjfr commented Nov 5, 2016 •

edited

Loading