Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GOAL] Support viewing of medium, multi-channel timeseries data #6058

Closed
3 of 4 tasks
droumis opened this issue Jan 5, 2024 · 7 comments
Closed
3 of 4 tasks

[GOAL] Support viewing of medium, multi-channel timeseries data #6058

droumis opened this issue Jan 5, 2024 · 7 comments
Labels
TRIAGE Needs triaging

Comments

@droumis
Copy link
Member

droumis commented Jan 5, 2024

Is your feature request related to a problem? Please describe.

Target use cases of stacked timeseries commonly utilize a significant number of lines and samples which requires aggregation or downsampling in order to send to the browser. Currently, due to performance limitations, the standard HoloViews+Bokeh approach to this visualization with subcoordinate_y is only useable with a small part of the typical data size range.

Describe the solution you'd like

Let's shoot to make it not only possible, but smooth/performant to visualize and interact with:
Medium Size Dataset: too big for browser but fits in memory. For instance, a data size target of 100 stacked traces, each with 1,000 (16 bit) samples per second for 10,000 seconds. That's one billion samples and about 2 GB.

Task List (Updated):

Medium Size

Notes for Medium Size:

  • @philippjfr suggested that our implementation of LTTB downsampling would not scale sufficiently well for large datasets. This suggestion is backed up by comments in the plotly resampler: "for large datasets, [LTTB] can be much slower than other algorithms (e.g. MinMax) due to the higher cost of calculating the areas of triangles...LTTB doesn't scale super-well when moving to really large datasets, so when dealing with more than 1 million samples, you might consider using [MinMaxLTTB][aggregation.aggregators.MinMaxLTTB]"
  • So either we could try an implementation of MinMaxLTTB (or even rely on the rust implementation of tsdownsample as plotly does - note, the author of tsdownsample had offered to help!), ...
  • or we could try to get datashader to play nicely with subcoordinate_y while retaining all the niceties of standard Bokeh interactivity. If we go with Datashader, this would likely entail passing the scale and offset for each trace into the pre-Datashader rendering pipeline step.
  • Update: Implement HoloViews' use of indexing for single shot slicing of wide df (big performance impact) and then use minmaxLTTB (less of a performance impact).
@droumis droumis added the TRIAGE Needs triaging label Jan 5, 2024
@droumis droumis moved this to Todo in CZI R5 neuro Jan 5, 2024
@droumis
Copy link
Member Author

droumis commented Jan 5, 2024

After talking with @philippjfr, the updated task is to first try to utilize tsdownsample directly (if it is available). Then we'll have LTTB and MinMaxLTTB available to us and we can check if this is sufficiently performant for our use cases. If not, we can then explore other options with Datashader.

@philippjfr
Copy link
Member

Note that the precise downsampling implementation doesn't seem to matter much at all because most of the time is dominated by the slicing step, i.e. selecting the data within the viewport.

@philippjfr
Copy link
Member

Having played with it some more I think the only way to support this workflow better is to add in an optimization for wide dataframes. Specifically if you create an NdOverlay of Curve elements from a DataFrame with columns A, B, C we need to make sure that all three Curve elements share the same underlying DataFrame, and the downsample operation should detect that, slice the DataFrame based on the current viewport and then apply the downsampling to that pre-sliced data. This will massively speed up downsampling for large numbers of traces.

@philippjfr
Copy link
Member

This is probably a pre-requisite to get the above mentioned workflows working well: #6061

@philippjfr
Copy link
Member

Okay just to capture what I think needs to happen to support this workflow well. Currently the cost of the operation can be broken down into N*slicing + N*downsampling where (N is the number of traces).

1. Shared Data Slicing

The downsample1d operation should check if all elements in an NdOverlay share the same underlying DataFrame, if so they should slice the data once. This will reduce the cost to 1*slicing + N*downsampling.

This optimization itself is relatively easy to achieve and I'll add it to the existing PR adding optional tsdownsample support, the bigger obstacle is in hvPlot, which will be the primary entrypoint in many/most cases. Specifically we would want this optimization to apply when working with a wide DataFrame, e.g.:

import pandas as pd
import hvplot.pandas

df = pd._testing.makeDataFrame().reset_index(drop=True)

df.hvplot.line(downsample=True)

The problem here is that internally hvPlot is generating copies of the DataFrames where it renames each column in turn from it's original name (here that's A, B, C, D) to the value_label (here that's the default value). This means that the optimization won't bite. I have not yet figured out what the right approach should be.

2. Pandas Index Slicing

Slicing on a Pandas index is (significantly) faster than slicing on a column, therefore we should allow HoloViews to operate directly on a DataFrame with an index (instead of dropping the index as we do now). This work was started in #6061. This is likely the highest effort but also has the largest benefits beyond this particular workflow.

3. Optimizing the downsampling

Once we have done 1 (and 2) the cost of the operation will be dominated by the N*downsampling part of the equation. This is the simplest task and is already mostly done in #6059.

@droumis droumis assigned hoxbro and unassigned maximlt Jan 23, 2024
@droumis droumis changed the title Visualize (long and many) stacked timeseries Visualize (long and many) multi-channel timeseries Apr 2, 2024
@droumis droumis changed the title Visualize (long and many) multi-channel timeseries GOAL: Visualize (long and many) multi-channel timeseries Apr 5, 2024
@droumis droumis changed the title GOAL: Visualize (long and many) multi-channel timeseries [GOAL] Visualize (long and many) multi-channel timeseries Apr 5, 2024
@droumis droumis changed the title [GOAL] Visualize (long and many) multi-channel timeseries [GOAL] Support for viewing medium-size multi-channel timeseries data Apr 5, 2024
@droumis droumis changed the title [GOAL] Support for viewing medium-size multi-channel timeseries data [GOAL] Support viewing medium-size multi-channel timeseries data Apr 5, 2024
@droumis droumis changed the title [GOAL] Support viewing medium-size multi-channel timeseries data [GOAL] Support viewing of medium, multi-channel timeseries data Apr 5, 2024
@droumis droumis removed the status in CZI R5 neuro Apr 5, 2024
@droumis
Copy link
Member Author

droumis commented Jun 23, 2024

The remaining task has to do with hvPlot, so I'll close this as the HoloViews aspects are largely completed

@droumis droumis closed this as completed Jun 23, 2024
Copy link

This issue has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 23, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
TRIAGE Needs triaging
Projects
None yet
Development

No branches or pull requests

4 participants