Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimized datashader aggregation of NdOverlays #1430

Merged
merged 5 commits into from
May 12, 2017

Conversation

philippjfr
Copy link
Member

This PR provides major optimizations when using the datashader operations to aggregate multiple objects in an NdOverlay using the count, sum, and mean operations. Each Element is aggregated separately and the individual aggregates are summed. A small complication is that NaNs have to be replaced by zeros and masked at the end. mean is supported by dividing sum and count aggregates. This avoids the large memory and performance overhead of concatenating multiple dataframes together. I'm still working on adding an optimization for count_cat but it should also be fairly straightforward.

@jbednar
Copy link
Member

jbednar commented May 11, 2017

Excellent, thanks! I'm too tired to try to parse the code, but is it using the ability of datashader to compute multiple aggregations in a single pass?

@philippjfr
Copy link
Member Author

Excellent, thanks! I'm too tired to try to parse the code, but is it using the ability of datashader to compute multiple aggregations in a single pass?

No, even so it's still faster, which is perhaps a bit surprising. I'll do some more profiling tomorrow.

@philippjfr
Copy link
Member Author

I was wrong it's slightly slower, but including the concatenation step it still wins out massively both on performance and memory load.

@philippjfr
Copy link
Member Author

Means it's perhaps still worth optimizing get_agg_data directly to ensure it only has to be done once. Particularly worth testing how well concatenating multiple dask dataframes performs.

@philippjfr
Copy link
Member Author

count_cat now optimized as well.

@philippjfr
Copy link
Member Author

Here are some benchmarks, the data here are 12 curves of increasing length where 1 minute is equivalent to 60000*60 samples. The four conditions are comparing line aggregation of multiple curves either by summing the aggregates (the new approach) or by aggregating over concatenated curves separated by NaNs.

bokeh_plot 47

bokeh_plot 53

You can see that the new approach is generally slightly slower than aggregating over already concatenated lines, but it scales much better when using dask.

@jlstevens
Copy link
Contributor

@philippjfr Thanks for fixing the warning!

Is it now ready to merge or is there something else you wish to do first?

@philippjfr
Copy link
Member Author

Yes, this is ready to merge now. Further optimizations can come in later PRs.

@jlstevens
Copy link
Contributor

Great! Merging.

@jlstevens jlstevens merged commit a99833f into master May 12, 2017
@philippjfr philippjfr deleted the datashader_ndoverlay_opt branch May 25, 2017 11:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants