[GOAL] Support viewing of medium, multi-channel timeseries data #6058

droumis · 2024-01-05T02:18:26Z

Is your feature request related to a problem? Please describe.

Target use cases of stacked timeseries commonly utilize a significant number of lines and samples which requires aggregation or downsampling in order to send to the browser. Currently, due to performance limitations, the standard HoloViews+Bokeh approach to this visualization with subcoordinate_y is only useable with a small part of the typical data size range.

Describe the solution you'd like

Let's shoot to make it not only possible, but smooth/performant to visualize and interact with:
Medium Size Dataset: too big for browser but fits in memory. For instance, a data size target of 100 stacked traces, each with 1,000 (16 bit) samples per second for 10,000 seconds. That's one billion samples and about 2 GB.

Task List (Updated):

Medium Size

Step 1: Downsample1d operation should slice once if all elements in an NdOverlay share the same underlying DataFrame (implemented in Use tsdownsample library for downsampling if available #6059) [@philippjfr @hoxbro]
Step 2: Allow HoloViews to operate directly on a DataFrame with an index Implement support for retaining Pandas index #6061 [@philippjfr @hoxbro]
Step 3: Use tsdownsample library for downsampling if available Use tsdownsample library for downsampling if available #6059 [@philippjfr @hoxbro]
Step 4: Add HoloViews' index support to hvPlot [@maximlt]

Notes for Medium Size:

@philippjfr suggested that our implementation of LTTB downsampling would not scale sufficiently well for large datasets. This suggestion is backed up by comments in the plotly resampler: "for large datasets, [LTTB] can be much slower than other algorithms (e.g. MinMax) due to the higher cost of calculating the areas of triangles...LTTB doesn't scale super-well when moving to really large datasets, so when dealing with more than 1 million samples, you might consider using [MinMaxLTTB][aggregation.aggregators.MinMaxLTTB]"
So either we could try an implementation of MinMaxLTTB (or even rely on the rust implementation of tsdownsample as plotly does - note, the author of tsdownsample had offered to help!), ...
or we could try to get datashader to play nicely with subcoordinate_y while retaining all the niceties of standard Bokeh interactivity. If we go with Datashader, this would likely entail passing the scale and offset for each trace into the pre-Datashader rendering pipeline step.
Update: Implement HoloViews' use of indexing for single shot slicing of wide df (big performance impact) and then use minmaxLTTB (less of a performance impact).

The text was updated successfully, but these errors were encountered:

droumis · 2024-01-05T17:47:19Z

After talking with @philippjfr, the updated task is to first try to utilize tsdownsample directly (if it is available). Then we'll have LTTB and MinMaxLTTB available to us and we can check if this is sufficiently performant for our use cases. If not, we can then explore other options with Datashader.

philippjfr · 2024-01-05T22:43:53Z

Note that the precise downsampling implementation doesn't seem to matter much at all because most of the time is dominated by the slicing step, i.e. selecting the data within the viewport.

philippjfr · 2024-01-06T12:51:55Z

Having played with it some more I think the only way to support this workflow better is to add in an optimization for wide dataframes. Specifically if you create an NdOverlay of Curve elements from a DataFrame with columns A, B, C we need to make sure that all three Curve elements share the same underlying DataFrame, and the downsample operation should detect that, slice the DataFrame based on the current viewport and then apply the downsampling to that pre-sliced data. This will massively speed up downsampling for large numbers of traces.

philippjfr · 2024-01-06T13:54:45Z

This is probably a pre-requisite to get the above mentioned workflows working well: #6061

philippjfr · 2024-01-11T15:07:12Z

Okay just to capture what I think needs to happen to support this workflow well. Currently the cost of the operation can be broken down into N*slicing + N*downsampling where (N is the number of traces).

1. Shared Data Slicing

The downsample1d operation should check if all elements in an NdOverlay share the same underlying DataFrame, if so they should slice the data once. This will reduce the cost to 1*slicing + N*downsampling.

This optimization itself is relatively easy to achieve and I'll add it to the existing PR adding optional tsdownsample support, the bigger obstacle is in hvPlot, which will be the primary entrypoint in many/most cases. Specifically we would want this optimization to apply when working with a wide DataFrame, e.g.:

import pandas as pd
import hvplot.pandas

df = pd._testing.makeDataFrame().reset_index(drop=True)

df.hvplot.line(downsample=True)

The problem here is that internally hvPlot is generating copies of the DataFrames where it renames each column in turn from it's original name (here that's A, B, C, D) to the value_label (here that's the default value). This means that the optimization won't bite. I have not yet figured out what the right approach should be.

2. Pandas Index Slicing

Slicing on a Pandas index is (significantly) faster than slicing on a column, therefore we should allow HoloViews to operate directly on a DataFrame with an index (instead of dropping the index as we do now). This work was started in #6061. This is likely the highest effort but also has the largest benefits beyond this particular workflow.

3. Optimizing the downsampling

Once we have done 1 (and 2) the cost of the operation will be dominated by the N*downsampling part of the equation. This is the simplest task and is already mostly done in #6059.

droumis · 2024-06-23T18:39:11Z

The remaining task has to do with hvPlot, so I'll close this as the HoloViews aspects are largely completed

github-actions · 2024-10-23T09:27:55Z

This issue has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

droumis added the TRIAGE Needs triaging label Jan 5, 2024

droumis added this to CZI R5 neuro Jan 5, 2024

droumis moved this to Todo in CZI R5 neuro Jan 5, 2024

philippjfr mentioned this issue Jan 5, 2024

Use tsdownsample library for downsampling if available #6059

Merged

droumis assigned maximlt Jan 9, 2024

droumis moved this from Todo to WIP in CZI R5 neuro Jan 9, 2024

droumis mentioned this issue Jan 9, 2024

Implement support for retaining Pandas index #6061

Merged

16 tasks

philippjfr mentioned this issue Jan 18, 2024

Optimize downsample1d when data is shared between layers #6075

Merged

droumis assigned hoxbro and unassigned maximlt Jan 23, 2024

droumis changed the title ~~Visualize (long and many) stacked timeseries~~ Visualize (long and many) multi-channel timeseries Apr 2, 2024

droumis changed the title ~~Visualize (long and many) multi-channel timeseries~~ GOAL: Visualize (long and many) multi-channel timeseries Apr 5, 2024

droumis changed the title ~~GOAL: Visualize (long and many) multi-channel timeseries~~ [GOAL] Visualize (long and many) multi-channel timeseries Apr 5, 2024

droumis changed the title ~~[GOAL] Visualize (long and many) multi-channel timeseries~~ [GOAL] Support for viewing medium-size multi-channel timeseries data Apr 5, 2024

droumis changed the title ~~[GOAL] Support for viewing medium-size multi-channel timeseries data~~ [GOAL] Support viewing medium-size multi-channel timeseries data Apr 5, 2024

droumis changed the title ~~[GOAL] Support viewing medium-size multi-channel timeseries data~~ [GOAL] Support viewing of medium, multi-channel timeseries data Apr 5, 2024

droumis removed the status in CZI R5 neuro Apr 5, 2024

droumis unassigned hoxbro Apr 5, 2024

droumis closed this as completed Jun 23, 2024

github-actions bot locked as resolved and limited conversation to collaborators Oct 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[GOAL] Support viewing of medium, multi-channel timeseries data #6058

[GOAL] Support viewing of medium, multi-channel timeseries data #6058

droumis commented Jan 5, 2024 •

edited

Loading

droumis commented Jan 5, 2024

philippjfr commented Jan 5, 2024

philippjfr commented Jan 6, 2024

philippjfr commented Jan 6, 2024

philippjfr commented Jan 11, 2024

droumis commented Jun 23, 2024

github-actions bot commented Oct 23, 2024

[GOAL] Support viewing of medium, multi-channel timeseries data #6058

[GOAL] Support viewing of medium, multi-channel timeseries data #6058

Comments

droumis commented Jan 5, 2024 • edited Loading

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Task List (Updated):

Medium Size

Notes for Medium Size:

droumis commented Jan 5, 2024

philippjfr commented Jan 5, 2024

philippjfr commented Jan 6, 2024

philippjfr commented Jan 6, 2024

philippjfr commented Jan 11, 2024

1. Shared Data Slicing

2. Pandas Index Slicing

3. Optimizing the downsampling

droumis commented Jun 23, 2024

github-actions bot commented Oct 23, 2024

droumis commented Jan 5, 2024 •

edited

Loading