Add pipeline property to track data lineage #3967

jonmmease · 2019-09-18T09:54:28Z

Overview

This PR adds a new pipeline property to the Dataset class. This property holds a list of (function, args, kwargs) tuples that represent the sequence of operations needed to transform the Dataset stored in the dataset property into an element equal to current element.

It also adds a new execute_pipeline method that can evaluate this sequence of functions on an input dataset. This makes it possible to reproduce the same sequence of operations on a new Dataset.

Relationship to other PRs

dataset property

The dataset property was added to the LabelledData class in #3919. This PR moves the dataset property down to the Dataset class, so there is no longer a dataset property on, for example, the Layout class. This reduces the scope of where dataset and pipeline need to be correct / consistent.

Histogram _operation_kwargs

This PR removes all special cases associated with Histogram elements. So the Histogram._operation_kwargs property added in #3921 has been removed.

select all dims

In #3924, the select method is updated to consider all dimensions in the Dataset stored in the element's dataset property. This PR does not do this, and instead provides the execute_pipeline method as a more powerful alternative to acheiving the same goal. See examples below.

link_selections

This PR will become a more powerful foundation for the automatic linked selection support being added in #3951

Example 1: Points

Create a sample 3-dimensional dataset. x and y are independently drawn from the standard normal distribution and r is calculated to be the radius of each point from the origin.

import numpy as np
import pandas as pd
import holoviews as hv
from holoviews import dim
from holoviews.operation.datashader import rasterize, datashade, dynspread 
hv.extension('bokeh')

np.random.seed(1)
df = pd.DataFrame(np.random.randn(100, 2), columns=['x', 'y'])

# Add radius column
df['r'] = (df.x ** 2 + df.y ** 2) ** 0.5

ds = hv.Dataset(df)
points = ds.to.points(kdims=['x', 'y'], groupby=[])
points

Display the pipeline for the new points element

points.pipeline

[(holoviews.core.data.Dataset, [], {}),
 (holoviews.element.geom.Points,  [],  {'label': '', 'kdims': [Dimension('x'), Dimension('y')], 'vdims': []})]

Next, create a new points element by running execute_pipline on a subset of the dataset stored in points.dataset. Note that it would not be possible to compute this subset using points.select directly because it involves the r dimensions which is not a key or value dimension of points.

points * points.execute_pipeline(points.dataset.select(x=(0, None), r=(0, 1.5)))

Example 2: Datashade

Create an RGB image element from points using the datashade and dynspread operations with dynamic=False.

points_rgb = dynspread(datashade(points, dynamic=False), dynamic=False, threshold=0.9)
points_rgb

Display the pipeline for points_rgb

points_rgb.pipeline

[(holoviews.core.data.Dataset, [], {}),
 (holoviews.element.geom.Points,  [],  {'label': '', 'kdims': [Dimension('x'), Dimension('y')], 'vdims': []}),
 (datashade(...),  [],  {'dynamic': False}),
 (dynspread(...),  [],  {'dynamic': False, 'threshold': 0.9})]

Next, compute a new RGB element by calling the execute_pipeline method with a subset of the original dataset. Note that this is a selection that was not possible using the approach in #3924.

points_rgb + points_rgb.execute_pipeline(points_rgb.dataset.select(x=(0, None), r=(0, 1.5)))

Example 3: Histogram

Next, repeat the same process using a Histogram element created from points.

hist1 = hv.operation.histogram(points, num_bins=10, dynamic=False, normed=False)
hist1

Display pipeline

hist1.pipeline

[(holoviews.core.data.Dataset, [], {}),
 (holoviews.element.geom.Points,  [],  {'label': '', 'kdims': [Dimension('x'), Dimension('y')], 'vdims': []}),
 (histogram(...),  [],  {'num_bins': 10, 'dynamic': False, 'normed': False})]

Create new Histogram element with execute_pipeline

hist2 = hist1.execute_pipeline(hist1.dataset.select(x=(0, None), r=(0, 1.5))) 
hist1 * hist2

Example 4: Custom aggregation

In this example, create a Bars element from the result of aggregating an original Dataset.

df = pd.DataFrame({'a': [1, 2, 3, 4, 5, 6],
                   'b': [2, 1, 3, 0, 10, 4],
                   'c': [0, 0, 0, 1, 1, 1]
                  })
ds = hv.Dataset(df, kdims=['c'], vdims=['a', 'b'])
bars = ds.aggregate('c', function=np.sum).to(hv.Bars)
bars

pipeline

bars.pipeline

[(holoviews.core.data.Dataset,
  [],
  {'kdims': [Dimension('c')], 'vdims': [Dimension('a'), Dimension('b')]}),
 (<function holoviews.core.data.Dataset.aggregate(...)>,
  ['c'],
  {'function': <function numpy.sum(...)>}),
 (holoviews.element.chart.Bars,
  [],
  {'label': '',
   'kdims': [Dimension('c')],
   'vdims': [Dimension('a'), Dimension('b')]})]

Create a new Bars element on a subset of the original dataset using execute_pipeline

bars * bars.execute_pipeline(bars.dataset.select(b=(3, None)))

philippjfr · 2019-09-18T10:03:14Z

This is pretty much exactly what I expected when we discussed this so I'm very happy to see it seems to have worked. The _in_method flag is also what I imagined and it should handle nested method calls but I'm wondering I haven't yet spotted how it works for .apply(operation, ...) calls for instance, does the operation get added twice in that case?

jonmmease · 2019-09-18T10:09:11Z

Yeah, thanks for working through the design with me!

In terms of apply, since this is an accessor (not a method) it doesn't cause _in_method to be set.

points.apply(hv.operation.histogram).pipeline

[(holoviews.core.data.Dataset, [], {}),
 (holoviews.element.geom.Points,
  [],
  {'label': '', 'kdims': [Dimension('x'), Dimension('y')], 'vdims': []}),
 (histogram(...),
  [],
  {'dynamic': False})]

But that does remind me that I should add some tests for apply.

And, there might be a hole here if the thing apply calls is not already an operation. I'll take a look.

jonmmease · 2019-09-18T11:28:06Z

And, there might be a hole here if the thing apply calls is not already an operation. I'll take a look.

No, I don't think this is a problem. We only need to update the pipeline if the function passed to apply returns a Dataset object, and to do this the function would call out to an operation or call a method on the object.

points.apply(lambda p: p.select(x=(0, None))).pipeline

[(holoviews.core.data.Dataset, [], {}),
 (holoviews.element.geom.Points,
  [],
  {'label': '', 'kdims': [Dimension('x'), Dimension('y')], 'vdims': []}),
 (<function holoviews.core.data.Dataset.select(self, selection_expr=None, selection_specs=None, **selection)>,
  [],
  {'x': (0, None)})]

Hmm, there are also the opts and redim accessors. Do you think these should be captured in the pipeline?

philippjfr · 2019-09-18T11:33:13Z

We only need to update the pipeline if the function passed to apply returns a Dataset object, and to do this the function would call out to an operation or call a method on the object.

I frequently write functions that take an object compute something from it and then repack a new Dataset, e.g. here's an apply function I just wrote for a dashboard I'm writing:

    def get_table(ds):
        arr = ds.array()
        weights = list(zip(stocks.columns, arr[0, 2:])) if len(arr) else []
        return hv.Table(weights, 'Stock', 'Weight').opts(editable=True)

Hmm, there are also the opts and redim accessors

redim should definitely be captured since without it the pipeline might be invalid. I have no strong opinion on opts but for completeness sake I guess we should do it.

jonmmease · 2019-09-18T11:43:12Z

Ok, yeah. That's a good point regarding the apply function constructing a brand new object. So this will need to be captured separately from the PipelineMeta metaclass. Which is fine.

I'll work on apply, redim, and opts, next. Let me know if any other cases come to mind. Right now the following are covered:

Dataset methods
Operation subclasses
iloc and ndloc accessors

jonmmease · 2019-09-18T13:01:58Z

In 702e531 I added a new meta class to support pipelines in apply, redim, and opts accessors.

points.apply(
    lambda p: hv.Points(p.select(x=(0, None)).data)
).redim.label(x="The X Dim").opts(color='green').pipeline

[(holoviews.core.data.Dataset, [], {}),
 (holoviews.element.geom.Points,  [],  {'label': '', 'kdims': [Dimension('x'), Dimension('y')], 'vdims': []}),
 (holoviews.core.accessors.Apply, [], {'mode': None}),
 (<function holoviews.core.accessors.Apply.__call__(...)>,  [<function __main__.<lambda>(p)>],  {}),
 (holoviews.core.accessors.Redim, [], {'mode': 'dataset'}),
 (<function holoviews.core.accessors.Redim.__call__(...)>,  [None],  {'x': {'label': 'The X Dim'}}),
 (holoviews.core.accessors.Opts, [], {'mode': None}),
 (<function holoviews.core.accessors.Opts.__call__(...)>,  [],  {'color': 'green'})]

johnzzzzzzz · 2019-09-18T13:55:01Z

Jon,
This feature looks great!
I am personally interested in model-view-control links between different hvplot diagrams.
It looks like a view's (hv.Points) pipeline could be used to update a view, when the model selection (DataSet.select) changes.
Could you show an example of that working using bokeh backend?
For example 1, if the two Points Elements where in a Layout instead of an Overlay. Then if the ds Dataset had rows selected, these rows would be selected in both Elements. That is all the selected points would be displayed in the first Points, but only the rows selected and then passed through the pipeline would be displayed on the second Points.

This reverts commit 25d7674

jonmmease · 2019-09-18T14:49:29Z

Hi @johnzzzzzzz,

Have you seen #3951? This is work towards creating a workflow to automatically link selections between HoloViews elements (including those produced by hvplot). The next iteration of that PR is going to build on top of this pipeline work.

johnzzzzzzz · 2019-09-18T15:57:30Z

I am excited about #3951 and would like to help create test cases.
I will try to figure out how to clone a branch that includes this and #3951
This feature may also help panel 604 not supporting linked Elements

jbednar · 2019-09-18T21:30:28Z

I'm excited too. Would it be possible for obj.pipeline to work as it does above (returning the list) while obj.pipeline() does what is currently invoked with obj.execute_pipeline()? Having words like execute in a function call slightly annoys me, because every function call executes something, so it seems sufficient to convey "calling" with the standard Python () call syntax alone. But on a quick glance at the property I can't tell if overloading it in that way would work, so I'm just proposing it here if it's possible.

jonmmease · 2019-09-18T21:45:46Z

Would it be possible for obj.pipeline to work as it does above (returning the list) while obj.pipeline()

I don't think so, unless the thing returned by obj.pipeline isn't a standard Python list. We could return some object of our own that both represents the pipeline and evaluates it with __call__. But I'm not sure how intuitive this would be for users.

I'm definitely open to renaming execute_pipeline though!

jbednar · 2019-09-18T21:53:17Z

That's what I suspected. It would be easy to have something that prints like a list while being callable, but I agree that it's nicer to have it simply be a list when it's returned as a value. I don't have any suggestions for a better name, then.

This removes the `execute_pipeline` method

in the presence of exceptions

philippjfr · 2019-09-22T12:11:42Z

Looks good! I don't actually much like the group handling of operations. I think we should set the group default to None in most cases and then skip setting the group is it is None, e.g. in chain

return processed if self.p.group is None else processed.clone(group=self.p.group)

jonmmease · 2019-09-23T10:08:57Z

Looks good!

Thanks! It is nice for pipeline to be a standard chain operation.

I don't actually much like the group handling of operations. I think we should set the group default to None in most cases and then skip setting the group is it is None

That sounds good to me. I made the change in the chain operation in 50bd22a.

I think this PR is in pretty good shape now. Thanks for taking a look, and let me know if anything else comes to mind that we should do before merging.

…ration

philippjfr · 2019-09-23T16:24:38Z

Happy to see this merged. I'll give @jlstevens a chance to review though.

philippjfr · 2019-09-24T20:51:58Z

Okay, since he's on PTO for the foreseeable future I'm going to go ahead and merge.

jonmmease · 2019-09-26T13:03:32Z

Thanks!

github-actions · 2024-10-24T11:22:05Z

This pull request has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

Add pipeline dataset property to track data lineage

b32522a

jonmmease mentioned this pull request Sep 18, 2019

Support selections across all dimensions in element.dataset #3924

Closed

philippjfr added tag: API type: feature A major new feature labels Sep 18, 2019

Add pipeline support to apply, redim, and opts accessors

702e531

jonmmease added 5 commits September 18, 2019 09:12

Guard against accessors that wrap objects without pipeline support

18d3548

Fix dataset property histogram tests now that apply is added to pipeline

8ed7c5f

Copy docstrings to Metaclass wrapping methods

e016dc9

change metaclass arg name to mcs

11d8439

Override options method for pipeline support

7404723

jonmmease added 5 commits September 18, 2019 10:03

Add pipeline support for Dataset.map

2adf704

standardize names of args to pipelined_call

7ea2ba5

Fix pipeline tests now that map is a pipeline step

25d7674

remove trailing whitespace

f9e1f73

Revert "Fix pipeline tests now that map is a pipeline step"

b7eef37

This reverts commit 25d7674

jbednar mentioned this pull request Sep 18, 2019

Vega, Datashader, and Holoviews Collaboration Quansight/omnisci#67

Open

Handle pipeline functions that return the same element

9657592

Reset the dataset property and empty pipeline when clone replaces data

cb7cf8d

jonmmease added 9 commits September 19, 2019 08:26

Propagate dataset property through clone when _in_method

931b796

support relabel in pipeline

2d1cbae

Merge branch 'master' into pipeline

388687a

Merge branch 'master' into pipeline

cfa3449

Convert pipeline to be a chain operation

fdfd538

This removes the `execute_pipeline` method

Fix tests

1ddb237

Update pipeline docstring

a610829

Remove execute_pipeline from blacklist

1ac15d3

Use try/finally when setting _in_method to avoid inconsistent state

c2a5a96

in the presence of exceptions

jonmmease added 2 commits September 23, 2019 06:09

Make chain operation default to group of element produced by last ope…

50bd22a

…ration

unused import

ba9b4df

philippjfr merged commit a58216c into master Sep 24, 2019

philippjfr deleted the pipeline branch October 2, 2019 17:11

github-actions bot locked as resolved and limited conversation to collaborators Oct 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add pipeline property to track data lineage #3967

Add pipeline property to track data lineage #3967

jonmmease commented Sep 18, 2019

philippjfr commented Sep 18, 2019

jonmmease commented Sep 18, 2019 •

edited

Loading

jonmmease commented Sep 18, 2019

philippjfr commented Sep 18, 2019 •

edited by jbednar

Loading

jonmmease commented Sep 18, 2019

jonmmease commented Sep 18, 2019

johnzzzzzzz commented Sep 18, 2019

jonmmease commented Sep 18, 2019

johnzzzzzzz commented Sep 18, 2019

jbednar commented Sep 18, 2019

jonmmease commented Sep 18, 2019

jbednar commented Sep 18, 2019

philippjfr commented Sep 22, 2019

jonmmease commented Sep 23, 2019 •

edited

Loading

philippjfr commented Sep 23, 2019

philippjfr commented Sep 24, 2019

jonmmease commented Sep 26, 2019

github-actions bot commented Oct 24, 2024

Add pipeline property to track data lineage #3967

Add pipeline property to track data lineage #3967

Conversation

jonmmease commented Sep 18, 2019

Overview

Relationship to other PRs

dataset property

Histogram _operation_kwargs

select all dims

link_selections

Example 1: Points

Example 2: Datashade

Example 3: Histogram

Example 4: Custom aggregation

philippjfr commented Sep 18, 2019

jonmmease commented Sep 18, 2019 • edited Loading

jonmmease commented Sep 18, 2019

philippjfr commented Sep 18, 2019 • edited by jbednar Loading

jonmmease commented Sep 18, 2019

jonmmease commented Sep 18, 2019

johnzzzzzzz commented Sep 18, 2019

jonmmease commented Sep 18, 2019

johnzzzzzzz commented Sep 18, 2019

jbednar commented Sep 18, 2019

jonmmease commented Sep 18, 2019

jbednar commented Sep 18, 2019

philippjfr commented Sep 22, 2019

jonmmease commented Sep 23, 2019 • edited Loading

philippjfr commented Sep 23, 2019

philippjfr commented Sep 24, 2019

jonmmease commented Sep 26, 2019

github-actions bot commented Oct 24, 2024

jonmmease commented Sep 18, 2019 •

edited

Loading

philippjfr commented Sep 18, 2019 •

edited by jbednar

Loading

jonmmease commented Sep 23, 2019 •

edited

Loading