Optimized datashader aggregation of NdOverlays #1430

philippjfr · 2017-05-11T00:58:45Z

This PR provides major optimizations when using the datashader operations to aggregate multiple objects in an NdOverlay using the count, sum, and mean operations. Each Element is aggregated separately and the individual aggregates are summed. A small complication is that NaNs have to be replaced by zeros and masked at the end. mean is supported by dividing sum and count aggregates. This avoids the large memory and performance overhead of concatenating multiple dataframes together. I'm still working on adding an optimization for count_cat but it should also be fairly straightforward.

jbednar · 2017-05-11T02:05:58Z

Excellent, thanks! I'm too tired to try to parse the code, but is it using the ability of datashader to compute multiple aggregations in a single pass?

philippjfr · 2017-05-11T02:07:05Z

Excellent, thanks! I'm too tired to try to parse the code, but is it using the ability of datashader to compute multiple aggregations in a single pass?

No, even so it's still faster, which is perhaps a bit surprising. I'll do some more profiling tomorrow.

philippjfr · 2017-05-11T02:11:09Z

I was wrong it's slightly slower, but including the concatenation step it still wins out massively both on performance and memory load.

philippjfr · 2017-05-11T02:18:03Z

Means it's perhaps still worth optimizing get_agg_data directly to ensure it only has to be done once. Particularly worth testing how well concatenating multiple dask dataframes performs.

philippjfr · 2017-05-11T02:52:28Z

count_cat now optimized as well.

philippjfr · 2017-05-11T14:00:27Z

Here are some benchmarks, the data here are 12 curves of increasing length where 1 minute is equivalent to 60000*60 samples. The four conditions are comparing line aggregation of multiple curves either by summing the aggregates (the new approach) or by aggregating over concatenated curves separated by NaNs.

You can see that the new approach is generally slightly slower than aggregating over already concatenated lines, but it scales much better when using dask.

jlstevens · 2017-05-12T12:53:13Z

@philippjfr Thanks for fixing the warning!

Is it now ready to merge or is there something else you wish to do first?

philippjfr · 2017-05-12T12:57:38Z

Yes, this is ready to merge now. Further optimizations can come in later PRs.

jlstevens · 2017-05-12T13:00:36Z

Great! Merging.

philippjfr added 4 commits May 11, 2017 15:11

Optimized datashader aggregation of NdOverlays

c50477d

Small bug fixes for optimized NdOverlay aggregation

ddd36d4

Added support for optimizing count_cat on NdOverlay

6fe8d36

Fixes for optimized count_cat aggregation

e898c0c

philippjfr force-pushed the datashader_ndoverlay_opt branch from 744548d to e898c0c Compare May 11, 2017 14:12

Fixed invalid param warnings in optimized datashader aggregate

71201cb

jlstevens merged commit a99833f into master May 12, 2017

philippjfr deleted the datashader_ndoverlay_opt branch May 25, 2017 11:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimized datashader aggregation of NdOverlays #1430

Optimized datashader aggregation of NdOverlays #1430

philippjfr commented May 11, 2017

jbednar commented May 11, 2017

philippjfr commented May 11, 2017

philippjfr commented May 11, 2017

philippjfr commented May 11, 2017

philippjfr commented May 11, 2017

philippjfr commented May 11, 2017

jlstevens commented May 12, 2017

philippjfr commented May 12, 2017

jlstevens commented May 12, 2017

Optimized datashader aggregation of NdOverlays #1430

Optimized datashader aggregation of NdOverlays #1430

Conversation

philippjfr commented May 11, 2017

jbednar commented May 11, 2017

philippjfr commented May 11, 2017

philippjfr commented May 11, 2017

philippjfr commented May 11, 2017

philippjfr commented May 11, 2017

philippjfr commented May 11, 2017

jlstevens commented May 12, 2017

philippjfr commented May 12, 2017

jlstevens commented May 12, 2017